Subject: Re: [boost] [fiber] new version in vault
From: Helge Bahmann (hcb_at_[hidden])
Date: 2009-12-01 05:05:26
On Mon, 30 Nov 2009, Phil Endecott wrote:
> Helge Bahmann wrote:
>> if you don't believe me, just disassemble pthread_mutex_lock/unlock on any
>> linux system.
> When I last looked at the glibc source, I believe I found quite a lot of code
> that needed to test whether the mutex was e.g. recursive or not, on top of
> that core functionality.
Sure, there is a lot of code, but the fast-path check for a normal mutex
(properly marked with __builtin_expect) is just at the top of the
function, and it drops straight into CAS, so with the compiler doing its
job, it *should* not matter.
(However, the fetch of the thread id from TLS should be moved out of the
fast path as well -- the compiler is probably not clever enough to move it
>> As for the space overhead of a pthread_mutex_t... if you cannot pay that,
>> just use hashed locks.
> A hashed lock library would be welcome here, I'm sure.
Yes, this would be a really helpful addition to Boost.Thread --
implementing fallback for atomic operations is just not feasible without.
>> Last note: calling "sched_yield" on contention is about the *worst* thing
>> you can do -- linux/glibc will call futex(..., FUTEX_WAIT, ...) instead on
>> contention, which can properly suspend the thread exactly until the lock is
>> released *and* is about 1/2 - 2/3 the cost of sched_yield in the case the
>> lock was released before the thread could be put to sleep.
>> Whatever gains you think you may achieve, you won't.
> My work on this was backed up with extensive benchmarking, disassembly of the
> generated code, and other evaluation. You can find some of the results in
> the list archive from about two years ago. There are many different types of
> system with different characteristics (uniprocessor vs multiprocessor, two
> threads vs 10000 threads, etc etc). Two particular cases that I'll mention
I guess this is the code you used for testing?
I would say that your conclusions are valid for ARM only (I don't know the
architecture or libc peculiarities), for x86 there are some subtleties
which IMHO invalidate the comparison.
Your spinlock implementation defers to __sync_lock_test_and_set, which in
turn generates an "xchgl" instruction, and NOT an "lock xchgl" instruction
(yes, these gcc primitives are tricky which is why I avoid them). Section
8.2.2 of Intel's architecture manual (Volume 3B) states that only the
"lock" variants of the atomic instructions serialize reads wrt to writes,
therefore using a non-locked variant may allow reads to be scheduled out
of the critical section and see invalid data. Additionally,
__sync_synchronize just does nothing on x86, so this does not help either.
Your VIA C3 will be classified as "i586" (not i686, although it is lacking
only few instructions), and therefore not use NPTL/futex at all, but old
LinuxThreads. (Yes, this border case is badly supported by glibc). I guess
if you force linking to the "wrong" i686 pthread library, pthread_mutex_*
numbers will improve drastically.
Your futex implementation is however correct, and your P4 numbers for
example match my numbers quite well -- and pthread_mutex_* is
> - An inline spin lock is the only thing that doesn't involve a function call,
> so leaf functions remain leaf functions and are themselves more likely to be
> inlined or otherwise optimised. On systems with small caches or small flash
> chips where code size is important, this is a significant benefit.
I'm not sure I'm following here -- for small cache sizes, inlining is
*not* preferrable, right?
> - sched_yield is a great deal easier to port between operating systems than
> anything else, other than a spin lock.
sched_yield has one very valid use case: SCHED_RR tasks of the same
priority yielding to each other. Using it on lock contention is just
calling into the scheduler saying "I'm waiting for something, but I won't
tell you what -- please have a guess at what is best to do next". And when
you have already gone to all the lengths of detecting contention, this is
a relatively poor thing to do. Sometimes you may do this as an ugly hack,
but I'm not sure this should be promoted as "good practice" by codifying
this in a library.
A home-grown futex-based implementation is of course valid and useful, but
on most architectures it will not be faster, and when it is not, I fail to
see why it would not be preferrable to fix the problems at the libc level
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk