Boost logo

Boost :

Subject: [boost] [lockfree::fifo] Review
From: Andrew Venikov (andrew.venikov_at_[hidden])
Date: 2009-12-04 12:01:14


I decided to start a new thread on this topic since the last thread on
the same topic is pretty outdated and not everyone will see it.

(you can disregard my reply to the previous thread as this one is more
complete)

First of all I think that having a library of lock-free
algorithms/containers would be greatly beneficial to a great number of
boost users. On the other hand, having implemented several lock-free
algorithms, I know that it is notoriously hard to get them right. We'll
need all the help we can get from knowledgeable people in the field. The
implementation needs to be thoroughly reviewed. Let me give the first
attempt at that.

In this e-mail I'll be primarily concentrating on the algorithm
correctness rather than the interface.

First, let me start with the choice of the algorithm for the fifo
queues: Magued Michael and Michael Scott's "Simple, Fast, and Practical
Non-blocking and blocking concurrent queues". I really think it is the
best choice. Having said that, however, I need to point out that there
some inherent limitation in this algorithm.

The first limitation is that you can't peek at the head of the queue
without dequeueing the element first, hence you can't get an interface
that is conformant with std::queue. But I truly believe that it will be
the case with any lock-free queue implementation, hence we should
abandon the hope of having a lock-free queue look the same as std::queue

Second, this algorithm makes certain assumptions about the ability to
access memory that has been freed. The algorithm can try to read memory
that has been de-allocated. (See further down for description of how).
Several debug implementations of malloc/free store every allocated
element on one page and when the element is freed, the page is marked
protected for both reading and writing. That means that this algorithm
can only work with a subset of custom allocators. Your "free_list" for
example can't work with it since it's possible for a node to be freed
when you call "deallocate". The "caching_freelist" and "static_freelist"
on the other hand work just fine because they never return the memory
back to the system.

Third, and the biggest limitation, is the algorithm's assumption that
you can store a pointer to an element plus an ABA tag in a contiguous
memory space that can be passed to a CAS instruction. I believe not all
platforms provide a DCAS instruction, that's why you're trying to work
around this issue by using "packed pointers", i.e. storing the ABA tag
in the part of the pointer that presumably is never used. The problem
with that, however, is that some libc implementations of malloc/free do
use all the bits in the pointer for their own needs, so you can't assume
that you can meddle with them.

My experience with this algorithm shows that the safest approach is to
use a pre-allocated array for the queue and replace pointers with
indexes into the array. Assuming that the size of an index is
sufficiently lower than the size of a pointer, you can use the remaining
bits to store the ABA tag. This alleviates both point 2 and 3. This does
mean however, that the queue can't have an unbound size, the size has to
be specified during the queue instantiation.

Fourth, on platforms where CAS is implemented using LL/SC (such as
PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on
most platforms is that the write granulum for marking LL's address as
"dirty" isn't just the memory pointed to by LL, but the whole cache line
where the memory resides. On these platforms it is important to make
sure that each "node" of the queue is located on a separate cache line.

Here's the code of your enqueue:

     bool enqueue(T const & t)
     {
         node * n = alloc_node(t);

         if (n == NULL)
             return false;

         for (;;)
         {
             atomic_node_ptr tail (tail_); //1
             read_memory_barrier(); //2
             atomic_node_ptr next (tail->next); //3

             if (likely(tail == tail_)) //4
             {
                 if (next.get_ptr() == 0) //5
                 {
                     if ( tail->next.cas(next, n) )//6
                     {
                         tail_.cas(tail, n);
                         return true;
                     }
                 }
                 else
                     tail_.cas(tail, next.get_ptr());
             }
         }
     }

1. You're missing write barrier after a call to alloc_node().
Rational: alloc_node sets the "next" pointer of the newly allocated node
to NULL. That is a store to shared memory. Setting next pointer to NULL
has to be observable to other threads BEFORE they observe the inserted
node. If you don't insert a write barrier, it is possible for other
threads to see the new element in the list before element's next pointer
is initialized with NULL.

2. You've misplaced the read_barrier.
Rational: Line 4 checks the validity of the load done on line 3. That
means that the read_barrier should be between lines 3 and 4. If you
don't have a read_barrier after loading *tail->next, then this load can
happen after the load of tail_ on line 4, which defies the purpose of
line 4.
Here's a scenario that makes it clear:

State: queue is well-formed and tail_ points to the last element.
1. Thread A executes line1.
State: the queue is unchanged and Thread A has a local copy of tail_
2. Since there's no read barrier after line 3, loads at lines 3 and 4
happen out-of-order. tail_ is pre-loaded for comparison at line 4.
3. Thread B removes the element pointed by tail_, advances tail_ further
State: tail_ is advanced, the old element is removed and given back to
the allocator. A's local copy of tail_ becomes stale.
4. Thread A executes line 3
State: A's copy of tail is stale. This copy is used to load memory that
has been reclamed by the allocator and possibly modified. The value of
the memory happens to be 0.
5. Thread A executes line 4. Since tail_ has been preloaded, line 4
succedes.
6. Line 5 succedes
7. Line 6 succedes. Boom: you've just modified a piece of memory not
controlled by the queue.

You still need to have a barrier between lines 1 and 3, but here only a
data dependency barrier will suffice, which is a weaker form of a read
barrier and on most all platforms is a no-op since data dependent loads
can be handled by processors automatically.

3. Your tail_ and head_ should be volatile. Furthermore, the getters for
you tagged_ptr (like get_ptr, operator->() and operator* ) should
return pointers/references to volatile types.
Rational: If your tail_ is not volatile, then the compiler is free to
load head_ just once and line 4 effectively becomes a no-op.

Here's code for your "dequeue":

     bool dequeue (T * ret)
     {
         for (;;)
         {
             atomic_node_ptr head (head_); //1
             read_memory_barrier(); //2

             atomic_node_ptr tail(tail_); //3
             node * next = head->next.get_ptr(); //4

             if (likely(head == head_)) //5
             {
                 if (head.get_ptr() == tail.get_ptr()) //6
                 {
                     if (next == 0)
                         return false;

                     tail_.cas(tail, next);
                 }
                 else
                 {
                     *ret = next->data; //7
                     if (head_.cas(head, next)) //8
                     {
                         dealloc_node(head.get_ptr()); //9

                         return true;
                     }
                 }
             }
         }
     }

1. Again, you need to put read_barrier between lines 4 and 5. Only data
dependency barrier will suffice at line 2.
2. You need to add write barrier after line 7 for the same reason as in
the enqueue.

Hope this was descriptive,

Andy Venikov.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk