Subject: Re: [boost] [lockfree::fifo] Review
From: Chris M. Thomasson (cristom_at_[hidden])
Date: 2009-12-21 08:44:26
"Tim Blechmann" <tim_at_[hidden]> wrote in message
On 12/20/2009 08:44 PM, Gottlob Frege wrote:
> > On Sun, Dec 20, 2009 at 11:17 AM, Tim Blechmann <tim_at_[hidden]> wrote:
> >> On 12/20/2009 04:57 PM, Chris M. Thomasson wrote:
> >>> "Tim Blechmann" <tim_at_[hidden]> wrote in message
> >>> news:4B2E4492.3050201_at_klingt.org...
> >>>>> Well, IMO, it should perform better because the producer and
> >>>>> consumer
> >>>>> are
> >>>>> not thrashing each other wrt the head and tail indexes.
> >>>> the performance difference is almost the same.
> >>> Interesting. Thanks for profiling it. Have you tried aligning
> >>> everything on
> >>> cache line boundaries? I would try to ensure that the buffer is
> >>> aligned and
> >>> padded along with the head and tail variables. For 64-byte cache line
> >>> and
> >>> 32-bit pointers you could do:
> > How about we go through the ring buffer by steps of 2^n - 1 such that
> > each next element is on a separate cache line? ie instead of
> > m_head = (m_head == T_depth - 1) ? 0 : (m_head + 1);
> > we do
> > m_head = (m_head + 7) % T_depth;
> > You still use each slot, just in a different order. You calculate 'n'
> > to be whatever you need based on the cell size. As long as the
> > resultant step size is prime mod T_depth.
> > I'm not sure if the false-sharing avoidance would be worth the cost of
> > using up more cache lines. Probably depends on how full the queue is,
> > etc.
I think it would be beneficial if the producer could push enough items to
fill up a cache line, and the consumer just pop's a cache line worth of
objects at a time.
> yes, i would guess, the performance characteristics differ depending on
> the number of elements in the buffer. also, the current implementation
> can easily be adapted to enqueue/dequeue multiple objects efficiently ...
For the single-producer/consumer queue you can treat the access like a
transaction. For instance, the producer can read/write to the buffer up to
the tail and commit the update just by incrementing the head. A consumer can
read/write to the produced portion of the buffer and commit by bumping the
tail. This is can good because you can avoid copying is come cases and allow
the thread to use the data in the buffer directly. Here is an example of
such an algorithm:
This version uses eventcounts to enforce boundary conditions because I did
not want to introduce any nasty spin-waiting. The behavior of this example
code is basically equal to a socket/pipe connection between two threads.