Subject: Re: [boost] [lockfree] Faster ring-buffer queue implementation
From: Gavin Lambert (gavinl_at_[hidden])
Date: 2014-01-16 18:07:06
On 16/01/2014 22:06, Quoth Alexander Krizhanovsky:
> The queue requires thread local pointers to head and tail (the same
> approach is used in LMAX Disruptor). Since each thread can have access
> to many queues, then the local pointers are implemented as an array of
> ThrPos structures and the array is a member of particular queue
> instance. However, to quickly access the queue we need small consequent
> thread identifiers.
> For example if we have two threads with identifiers 0 and 1 and two
> queues Q0 and Q1, then the thread will access their local pointers
> through Q0.thr_p_ and Q0.thr_p_ for first queue and Q1.thr_p_
> and Q1.thr_p_ for the second queue. This can be continued to
> arbitrary number of threads and queues. We only need to know on the
> queue construction how many threads will be using it.
What I don't like is the idea that a given thread is always "thread 1".
Imagine a situation where you have an existing queue that's already
used by one producer and one consumer, each of which is already "thread
0". Now another producer thread is added; this of necessity must be
thread 1. But that thread is also a consumer of another SPSC queue --
and yet that second queue must be declared as having two consumers,
despite "consumer thread 0" not existing for it.
Granted it's only a small memory wastage in this case (trivial next to
the cache alignment), but keeping all the thread ids straight in a
complex web of queues could get messy, plus having to lie to the queue
about how many producers and consumers it actually has, making it do
I'd be happier if the thread id were a parameter to push/pop instead of
using thread local data for that -- even thread-local is *too* global
for this. (Since producer and consumer ids are independent, the id is
only needed in that one spot anyway.)
>> And interleaving the per-thread head/tail indexes seems vulnerable to
>> false sharing.)
> False sharing is bad due to costly requests for ownership message of
> cache coherence protocol. tail and head pointers are modified only by
> the owning thread which is preferably should be binded to particular CPU
> (however, OS scheduler doesn't reschedule a thread to other CPU without
> need). Other threads/CPUs are only reads tail/head pointers of the thread.
A write to any of the 64 bytes on the cache line will temporarily block
reads to any other byte on that line. Having one writer and the rest
readers just avoids write races; it has no bearing on false sharing.
> There is sense to align ThrPos to cache line, so different CPUs won't
> update the same cache lines. I tried this, but didn't get intelligible
> performance improvement on my laptop (2 cores), 12 cores/2 processors
> and 40 cores/4 processors servers. Probably, there is sense to make more
Have you tried splitting the head/tail apart? I'm not really sure why
they're in the same structure to begin with -- you're already passing
n_producers and n_consumers separately, so you should be able to have
separate arrays of the appropriate lengths. I guess it saves one
malloc/free, but I don't think that matters.
It might be useful to have a variant that has n_producers and
n_consumers as template parameters -- this would allow constant-size
arrays to be used, and would let the compiler apply unrolling
optimisations to the iteration loops. And it's reasonable to expect
that this will be known at compile time, given the way that thread ids
are being set up anyway.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk