Subject: Re: [boost] [concurrent] Fast, lock-based queue implementation
From: David Stone (david_at_[hidden])
Date: 2017-07-29 00:02:48
Sorry for the late response, I was in the middle of a cross-country move
when I posted this.
It is probably easy to overlook the tests because they are currently put in
, which does not have the word "test" in its path anywhere. I believe it
has a fairly strong suite of tests, but I would be happy to be proven wrong
so that I can add more.
I will be updating the documentation soon to include the results of all
benchmarks, but here are some overall performance numbers for context.
Hardware: Intel i7 8 core processor (sixth generation, Skylake)
Platform: gcc or clang on Linux or mingw, with -O3 -march=native -flto
I am able to process 1 billion int per second with 1 reader thread, 1
writer thread, and 1000 int added per batch. This throughput scales
approximately linearly with batch size up until this point, so a batch size
of 500 gives you approximately 500 million int per second. After that, it's
a little trickier to improve throughput, but as far as I can tell the ideal
number on this type of processor gets you up to 3 billion int per second
with three reader threads and 3-10 writer threads (all gave approximately
the same result) writing in batches of 2400 messages.
Using the queue in the worst possible way can degrade performance pretty
significantly from this point. The worst case for the queue is to have many
(10+) readers doing almost no work per item with a single writer adding 1
element per batch. This case gives you only 550,000 int / second.
To help understand how the numbers work in the middle, 1 reader and 4
writers writing 40 elements per batch gives about 100 million int per
second through the queue.
On Visual Studio 2015 (Release mode), these numbers are about half on the
high end and unchanged on the low end. I have not yet investigated why this
is or how to optimize the performance for this compiler further.
To understand how performance would change with different types, it depends
in part on how the queue is used. If you can reserve enough memory that the
queue never gets larger than that value, you can have more predictable
performance. The performance of writes is then determined by the cost of
constructing the object via the constructor you chose. In the limit, this
is at least the 'size' of the object. On the read side, however, no such
cost applies because the pop_all function is O(1): it's just a swap of the
underlying container. This means that large elements do not have the same
cost as they would in most queue designs.
I will think a bit about supporting other methods of queue shutdown, but I
will only consider them if they do not add any time or space per element