Subject: Re: [boost] [fiber] new version in vault
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-11-30 18:22:51
Helge Bahmann wrote:
> Am Monday 30 November 2009 18:48:21 schrieb Phil Endecott:
>> Helge Bahmann wrote:
>> > why would you ever want to use a mutex that does not properly inform the
>> > system scheduler until when the calling thread should be suspended (in
>> > case of contention) for "true" inter-thread (or even inter-process)
>> > coordination?
>> When you believe that the probability of contention is very small, and
>> you care only about average and not worst-case performance, and you
>> wish to avoid the overhead (e.g. time or code or data size) of
>> "properly informing" the system scheduler.
> in the non-contention case, a properly implemented platform mutex will
> 1. compare_exchange with acquire semantic
> 2. detect contention with a single compare of the value obtained in step 1
> < your protected code here >
> 3. compare_exchange with release semantic
> 4. detect contention with a single compare of the value obtained in step 3 and
> wake suspended threads
> A clever implementation of course handles contention out-of-line.
> if you don't
> believe me, just disassemble pthread_mutex_lock/unlock on any linux system.
When I last looked at the glibc source, I believe I found quite a lot
of code that needed to test whether the mutex was e.g. recursive or
not, on top of that core functionality. On NPTL it was especially
slow, though perhaps we can now consider that historic. I was able to
get a very worthwhile performance improvement by implementing this
stuff myself using atomic ops and futex syscalls, rather than calling
glibc's pthread_* functions.
> FWIW, I just exercised a CAS-based mutex as you proposed (using the
> __sync_val_compare_and_exchange intrinsic), in a tight lock/unlock cycle on
> a Linux/PPC32 system and... it is 25% *slower* than the glibc
> pthread_mutex_lock/unlock based one! This is the "no contention case" you aim
> to optimize for... (a "proper" CAS-based mutex using inline assembly and with
> weaker memory barriers, is 10% faster, mainly because it eliminates the
> function call overhead). BTW we are talking about gains of ~5-10 clock cycles
> per operation here...
When I benchmarked this stuff a couple of years ago, I believe that I
found that gcc would often get its static branch prediction wrong for
the CAS loop. Using either branch hints in the source or
feedback-driven optimisation I could get the "expected" performance.
> As for the space overhead of a pthread_mutex_t... if you cannot pay that, just
> use hashed locks.
A hashed lock library would be welcome here, I'm sure.
> Last note: calling "sched_yield" on contention is about the *worst* thing you
> can do -- linux/glibc will call futex(..., FUTEX_WAIT, ...) instead on
> contention, which can properly suspend the thread exactly until the lock is
> released *and* is about 1/2 - 2/3 the cost of sched_yield in the case the
> lock was released before the thread could be put to sleep.
> Whatever gains you think you may achieve, you won't.
My work on this was backed up with extensive benchmarking, disassembly
of the generated code, and other evaluation. You can find some of
the results in the list archive from about two years ago. There are
many different types of system with different characteristics
(uniprocessor vs multiprocessor, two threads vs 10000 threads, etc
etc). Two particular cases that I'll mention are:
- An inline spin lock is the only thing that doesn't involve a function
call, so leaf functions remain leaf functions and are themselves more
likely to be inlined or otherwise optimised. On systems with small
caches or small flash chips where code size is important, this is a
- sched_yield is a great deal easier to port between operating systems
than anything else, other than a spin lock.