Boost logo

Boost Users :

Subject: Re: [Boost-users] lockfree::queue restrictions and performance on linux
From: Leon Mlakar (leon_at_[hidden])
Date: 2014-01-13 05:46:23


It would appear that it was something in my code that did not agree with
lockless::queue on Linux after all. Triggered by your response I
recompiled and rerun the code on Linux, and this time the results of the
lockless::queue were on par with locking alternative, which is now
closer to the expected behavior. They remain superior on other two
platforms, which again matches the expected outcome.

Unfortunately in the mean while I made quite a few changes to the other
parts of the code, too, and I'll have to go back through the diffs to
hopefully find the root cause.

So please disregard the performance complaints below, and thank you for
the pointers.

Best regards,

Leon

> On 13/01/14 07:54, Gavin Lambert wrote:
>> On 28/12/2013 23:50, Quoth Leon Mlakar:
>>> One of the libraries that got my attentions is lockfree, the
>>> lockfree::queue in particular, since it almost completely meets my
>>> needs. Except that I've been wondering why is it that it places the
>>> requirement for the trivial destructor on the items it stores? I mean,
>>> that really reduces its usability. Is this something inherent to its
>>> design? I guess that this has been discussed during the review but a
>>> quick digging through the archives didn't come up anything on this. The
>>> reason for asking is that I have forced the queue to accept
>>> std::shared_ptr and this seems to work almost fine, except that there
>>> seem to be an extra reference left to the item last popped, but that's
>>> something I can live with.
>>
>> It's definitely unsafe to violate those requirements. I asked the
>> maintainer the same question and after he mentioned why it was there
>> I confirmed the issue when I looked at the code.
>>
>> Basically there are cases when the value can get copied multiple
>> times, or copied after the destructor is called, and cases when the
>> destructor is never called at all. This breaks pretty much anything
>> that isn't a simple POD type.
>> If you want to store a shared_ptr in there, probably the safest thing
>> to do is to store a shared_ptr* instead. Note that this does mean
>> your producer will probably have to make a new copy of the const
>> shared_ptr& that you give it, and the consumer and destructor will
>> have to delete these extra copies.
>>
> Ah, thanks for clearing this up. It was not my intention to actually
> violate those requirements, except in the experimental code. It just
> struck me as odd, that's all.
>
> I think I'll be either looking for alternatives, or if none are
> suitable, try to redesign the code (it's a back port of Java/C# code
> to C++) into a single producer/single consumer pattern and use
> spsc_queue instead. shared_ptr with it's full semantics is someting I
> don't think I'd want to go without.
>> (If you don't *really* need the memory to be shared ownership -- eg.
>> you know that the producer will create new items and they'll only be
>> potentially shared after the consumer pops them, then you'll get
>> better performance by queuing a T or T* and only constructing the
>> shared_ptr<T> in the consumer.)
>>
>>> Another thing was its performance - I ran a sort of quick benchmark
>>> (queue of shared_ptr's to item containing atomic<int> as a counter,
>>> items were popped from the queue and immediately pushed back after
>>> decrementing the counter) and its perfromance was next to stellar on
>>> OSX
>>> (with multiple consumers and producers something like forty times
>>> faster
>>> than std::queue with mutex) but very poor on Linux. While I can
>>> understand that locking runs fast on Linux due to its spinlock/futex
>>> implementation of mutexes, I have no explanation for the
>>> poor performance of lockless::queue (with two producer/consumer threads
>>> std::queue with mutex was about 15 times faster and about 20 times
>>> slower than when run on OSX on the similar hardware). Did anybody else
>>> observe the same poor performance behavior? I used Boost 1.54 and
>>> tested
>>> with both gcc 4.8 and clang 3.5 and got similar results with both.
>>
>> Did you check sizeof(shared_ptr<T>) and .is_lock_free()? Depending on
>> memory alignment and the size of your queue payload objects it might
>> be falling back on a lock-based implementation. Also, try Boost 1.55.
>> There were quite a lot of changes between 1.54 and 1.55 to Boost.Atomic.
>
> No, I did not. I just run the same code on three OSes (OSX Maverick,
> Ubuntu 12.04 with most recent kernel update, Win7) using the same
> hardware. And it surely cannot be the locking problem since the
> locking alternative cleanly outperformed the lockless::queue on Linux,
> and only on Linux. Just to reiterate, with Linux I was not so much
> surprised by the performance of the locking alternative but by the
> poor performance of the lockless::queue. Even more so since in another
> experiment I have replaced the std::queue of the locking alternative
> with the lockless::queue and used it as if it was not thread safe
> (using std::mutex to synchronize pushes and pops) and got similar
> results - that is, locking alternative using lockless:queue (granted,
> it sounds sort of of weird :-) won by a large margin against
> lockless::queue without external locks.
>
> It is of course possible that there is something in my experimental
> code that collides with Linux. To be categorical on the point of
> performance I'd need to simplify the code down to the bare bones since
> experimental code was a sort of abstraction of the envisioned
> deployment. At the moment I do not yet see the point of doing so since
> at some I'll have to use the queue in the real code after all and
> results of academic experiment are not likely to be helpful.
>>
>> Also note that lock-free tends to be slower on average than
>> lock-based (particularly for the multi-producer/consumer-safe
>> implementations), unless there's a fair bit of contention. So it's
>> not always the best choice, depending on what you're doing with it.
>>
> It is a high contention scenario I coded on purpose. A typical use
> case will not have more threads that there are cores, and they'll
> exchange a fair amount of data through the queues. Unfortunately the
> existing specs prohibit the use of more efficient approaches.
>
> Best regards,
>
> Leon
>


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net