Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-10-23 18:28:49


Dear All,

I have been benchmarking different mutex implementations on Linux. The
variations that I have considered are the following:

NO_LOCKING: - no mutex at all, as a baseline for comparison.

SPINLOCK - uses an atomic operation and a tight loop.

YIELD - uses an atomic operation like SPINLOCK, and calls sched_yield()
if the lock is held by another thread.

PTHREADS - uses the pthread_mutex_*() functions; this is what the
current Boost mutexes do (as I understand it). The pthread_mutex
functions are implemented using the futex() system call.

FUTEX - this is my own implementation of the same algorithm that
PTHREADS uses internally, based on the futex() system call, but it uses
inlining to avoid a function call in the un-contended case.

Apart from the PTHREADS method, these should all be considered unproven
code; if anyone would like to review them for any subtle concurrency
issues or other problems you are very welcome to do so.

For background details of futexes, I suggest Ullrich Drepper's paper
"Futexes Are Tricky".

All of these results were obtained with g++ 4.1.2 with -O3.

Data Structure Sizes
--------------------

The size of the mutex object, in bytes, is as follows:

NO_LOCKING: 0
SPINLOCK: 4 [*]
YIELD: 4 [*]
FUTEX: 4 [*]
PTHREADS: 24

[*] in all of these cases I believe that a single byte, or perhaps even
a single bit in the case of SPINLOCK and YIELD, would be sufficient;
there may be some obscure alignment issue though.

I believe that the large size of the pthreads mutex object is because
it supports a number of variants such as recursive and checking
mutexes, and timeouts, etc. all in the same structure. My other
implementations are all very basic with none of these features.

Execution Time, Non-Contended Case
----------------------------------

In many uses mutexes are rarely contended, so optimising the speed of
the uncontended locking and unlocking operations is a common
objective. I have measured this overhead on three machines:

egypt: 1GHz VIA C3 x86 uniprocessor linux-2.6.19
ghana: 266MHz Intel XScale ARM uniprocessor linux-2.6.19
atum: 3.6GHz Intel Pentium 4 x86 2 processor linux-2.6.17

The times in nanoseconds for one lock and one unlock are:

               egypt ghana atum
NO_LOCKING 0 0 0
SPINLOCK 17 30 5.7
YIELD 13 38 6.4
FUTEX 50 95 26
PTHREADS 115 986 39

Execution Time, Contended Case
------------------------------

See my detailed results, linked below, for numbers (the runtimes given
are for 10^8 operations). The execution time in the contended case
includes both the execution time of the lock and unlock functions and
the time spent waiting for the lock to become available; it's not
straightforward to distinguish these and I haven't tried to. The broad
pattern is that SPINLOCK, naturally, is very slow, and the other
methods are relatively fast. In my benchmark, YIELD does better than
FUTEX, but this may well be a feature of the structure of my benchmark;
I would expect FUTEX to do better with large numbers of threads.
PTHREADS gets worse more quickly than other other methods as the number
of mutexes and threads increases, perhaps because the larger size of
its structures is less cache-friendly.

Code Size
---------

The code size overhead in bytes for each method is as follows.
(Note that this is with -O3, not -Os. My impression with -Os is that
it does even less inlining that I expected, e.g. even a static function
that's only called once is not inlined. I may investigate this further.)

                 x86 ARM
NO_LOCKING 0 0
SPINLOCK 42 32
YIELD 31 56
FUTEX 66 96
PTHREADS 25 28

Observations
------------

I suggest that we seriously consider using an inline futex
implementation of mutexes as the default choice on Linux, rather than
calling the pthread_mutex functions. This is about twice as fast (on
x86) and saves a lot of data size. On the other hand, it increases
code size.

The code size and speed can be improved further by using a simpler
method that just calls sched_yield(). However, this method is less
suitable for use as a default implementation because it will have worse
complexity in applications with contention: futex() knows which thread
to wake up next, while sched_yield() doesn't.

ARM Observations
----------------

Most ARM processors have only one atomic instruction, swap. My
implementations of SPINLOCK, YIELD and FUTEX make use of this.
PTHREADS has abysmal performance on ARM; I suspect, but have not yet
checked, that it somehow emulates the atomic instructions that it
doesn't have by calling into the kernel.

You can find all of the mutex implementation and benchmarking scripts
and the full results here:

https://svn.chezphil.org/mutex_perf/trunk/

I hope this is of interest.

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk