Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-02-10 13:39:09


Howard Hinnant wrote:
> On Feb 8, 2008, at 6:19 PM, Phil Endecott wrote:
>> If you can share the code, I'll see how it performs on my Linux
>> systems.
>
> Thanks Phil! Sorry for the delay. I wanted to update my test to use
> the latest std::syntax and rerun them again. First, here's the code:

OK, I have made some minimal mods to work with my atomic-ops-and-futex
mutex, and I get the following results. All of these results should be
treated as provisional since I don't have 100% confidence that my mutex
implementation is actually correct.

For my 1GHz uniprocessor VIA-C3 system:

Lock-in-order: 8 s
Retry: 7 s
Retry and yield: 7 s

For a 3.6 GHz Intel Pentium 4 dual-core system:

Lock-in-order: still running.....
Retry: 13 s
Retry and yield: 10 s

So the most obvious initial observation is that you really don't want
to run this sort of code on a multiprocessor system; this agrees with
your observation:

> Interesting side note: During the first run my disk backup utility
> kicked in. This extra competition for resources actually made the
> test case run faster!

I imagine that the allocation of threads to processors could also
explain things like:

> I have no explanation for the longer time in the third run.

I know that there is a certain amount of theoretical work about
affinity of threads to CPUs but I don't know what OSes actually do in
practice; do they migrate threads based on any run-time measures?

Something that you might find interesting is that I modified the
chopstick as follows:

struct Chopstick
{
      Chopstick() : random_next_(1) {}
      Mutex mut_;
      unsigned random_next_;
      char padding[8000]; <<<============
};

My aim was to prevent the chopsticks from being close together in the
cache. This only works if your Mutex object actually contains the
shared state itself, rather than being a pointer or some other sort of
handle. By doing this, I increased the performance on the dual-core
system with the retry-and-yield locking strategy to 6 seconds!
(Finally better than the system with a third of the clock speed (and a
tenth of the price)! Yay!)

I think that this sort of cache-friendly design is important; the
effect that I've measured here would occur in real programs, not just
in artificial benchmarks like this. Perhaps the Mutex needs its own
allocator, or something?

I got a further improvement, down to 5 s on the dual-core system, by
using my yield-based mutex (i.e. no futex at all, just a
yield-in-a-loop to lock). So the scale of this benchmark (5 threads)
is sufficiently small that the benefits of the futex (i.e. the thread
is not woken prematurely) are not outweighed by the overhead of
recording which thread is waiting for what. My guess is that your
additional yield() is part of the same phenomenon: if the number of
philosophers were increased until my Mutex<Futex> were better than my
Mutex<Yield>, then the benefit of the additional yield in the multilock
would also go away.

I plan to do some more measurements with different systems and
different mutex implementations (e.g. the evil-spin-then-wait and
friends). But we should remember that this is only an artificial benchmark.

Something that might be interesting to measure is the total number of
context switches (per CPU) while the benchmark runs. I can see that
with /usr/bin/time -f '%c %w'. If you have any similar facilities you
might like to see how correlated the total number of context switches
is with the total runtime.

Regards, Phil


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk