Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-02-29 14:06:55


Anthony Williams wrote:
> My spreadsheet with the results in can be downloaded from
> http://www.justsoftwaresolutions.co.uk/files/dining_results.ods

Excellent work!

> Not all of Andrey's runs are in there, but enough to get a reasonable mean and
> Standard-deviation for most cases.
>
> This shows some interesting observations. Firstly, the padding (8k between
> entries) is hugely beneficial for the small (8-byte) mutexes (boost 1.35, and my
> BTS-based variant), but the benefit is less for the CRITICAL_SECTION based
> mutex, which is 24-bytes. If you consider that the data is only 4 bytes, and
> each cache line is 64 bytes, this is unsurprising --- with a 24 byte mutex and 4
> byte data, you can only fit 2.5 entries in a cache line, so it's only adjacent
> entries that clash, whereas with a 8 byte mutex and 4 byte data you get 5
> entries per cache line, so there is much more conflict. This makes me wonder if
> it's worth padding the mutex to 64 bytes, so it occupies an entire cache line,
> or maybe adding a 64-byte alignment requirement.

I think that the mutex should be grouped together with the data that
it's protecting, e.g.

// BAD:
mutex mut[64];
Thing things[64];

// BETTER:
pair<mutex,Thing> protected_things[64];

The presence of the "things" will separate out the mutexes, avoiding
the cache-thrashing that you'd get if they were packed together, and
furthermore if you're lucky the "thing" will be pre-fetched into your
cache as soon as you lock its mutex. You could even do this:

template <typename T>
class Locked: public T {
   mutex mut;
public:
   void lock() { mut.lock(); }
   void unlock() { mut.unlock(); }
};

Locked<Thing> locked_things[64];

For small "things" it would still help to pad to the cache line size.
Various ways to do this come to mind: add a compiler-specific alignment
attribute (I think there's a proposal for a standard way to do this in
C++0x, right?), add an explicit padding field dependent on sizeof(T),
or for dynamically allocated objects write a Locked::operator new that
rounds up the size requested.

Of course it's wasteful to pad for mutexes that are rarely contested,
and in many cases mutexes are rarely contested. Wouldn't it be great
if our compilers could do profile-driven memory layout optimisation?
Sounds like a good PhD topic to me. In the meantime, I wonder what
else could be done at the library level to help programs not thrash
their caches?

> Secondly, the yield is quite beneficial in some cases (e.g. 58% improvement),
> but often detrimental (up to 33% degradation). Overall, I think it is not worth
> adding.

I still find that a mutex that ONLY uses yield gives very good
performance. This is because I don't yet have a benchmark where there
are large numbers of _unrelated_ threads running on the system.
"Dining Philosophers" is not representative of many real-world
problems, unfortunately....

> The next thing I noticed was quite surprising. On the machines with 16 hardware
> threads, the 32-thread versions often ran faster than the 16-thread versions. I
> can only presume that this is because the increased thread count meant less
> contention, and thus fewer blocking waits.

Yes.

> Finally, though the BTS-based mutex is faster than the boost 1.35 one in most
> cases, the CRITICAL_SECTION based version is actually very competitive, and
> fastest in many cases. I found this surprising, because it didn't match my prior
> experience, but might be a consequence of the (generally undesirable) properties
> of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to
> completion very quickly, thus reducing the running thread count for the
> remainder of the test.

I think fairness is over-rated. An unfair schedule will reduce
mean-time-to-completion, and get better cache hit rates.

Cheers,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk