Boost logo

Boost :

From: Howard Hinnant (hinnant_at_[hidden])
Date: 2008-02-08 15:37:01


On Feb 8, 2008, at 2:42 PM, Anthony Williams wrote:

> Howard Hinnant <hinnant <at> twcny.rr.com> writes:
>
>> template <class L1, class L2>
>> void
>> lock(L1& l1, L2& l2)
>> {
>> while (true)
>> {
>> {
>> unique_lock<L1> __u1(l1);
>> if (l2.try_lock())
>> {
>> __u1.release();
>> break;
>> }
>> }
>> std::this_thread::yield();
>
>
> snipped rest of multi-lock code.
>
>> 3. The yields are absolutely necessary to get the high performance
>> mentioned in the previous note.
>
> I find this comment intriguing. Is this always the case, or does it
> depend on
> the specifics of how a mutex is implemented on a given platform?

That is a question I've been asking and myself for a long time,
unfortunately without a satisfactory answer. Here's my experience,
and a few guesses...

On 2 core Macs and 2 core Windows, I've measured improved speeds with
the yields, using a "dining philosophers" style simulation. I haven't
tested on Linux.

Here are results (dating around 3 years ago):

In order: 63 seconds.
try/back off/without yield: 47 seconds.
try/back off/with yield: 34 seconds.

These numbers come from 5 threads each wanting two mutexes each
(classic dining philosopher setup). The same runs on single core
machines looked like:

In order: 293 seconds.
try/back off/without yield: not tested.
try/back off/with yield: 278 seconds.

The single core results were measured on different machines than the
dual core results and so aren't comparable to the dual core results.

I set up an additional "dining philosophers" test where each
philosopher set at the corner of a cube (8) and attempted to attain a
mutex associated with the edge of the cube adjacent to the corner (12
mutexes). That is, each diner required "3 forks" to eat. The results
were:

In order: 226 seconds.
try/back off/without yield: not tested.
try/back off/with yield: 77 seconds.

Unfortunately I didn't run the last two tests without the yields.

I believe the try/back off algorithms are faster than the in order
algorithm because in the latter a thread sits on a mutex while blocked
for another. Thus a "fork" that somebody else could be using is
idled. But in the try/back off algorithms the diners are "more
polite". If they can not eat, they offer their fork up so that
someone else may be able to eat.

Now what about the yields... Blocking on a mutex should be very
similar to a yield. But the first test is showing a significant speed
advantage to the yields. The above isn't noise, the tests were run
many times and the averages are reported. On running the tests on a
Mac I viewed the CPU activity in the "Activity Monitor" application.
When running without the yields, the activity monitor looked like:


(I hope you can see the graphic). The red in the graphic indicates
"system time" while the green indicates "user time". In case you
can't see the graphic, it looks like around 20% to 25% of the time is
spent in "system", with the rest spent in "user". When running with
the yields, the monitor looks like:


Now the "system time" is almost completely gone (maybe 7% or 8% system
time). So more "user time" translates to faster application (not so
much "system time"). But this still doesn't answer your precise
question: Is this behavior really portable? I'm afraid I don't
really know for sure. It has just been my experience on Mac and
Windows. The Windows tests were done 3 years ago, and I asked someone
to do them for me. I didn't have the machine available to investigate
myself. The Mac tests were done at the same time, but I've replicated
the results more recently. I haven't replicated the Windows results
more recently.

As long as I'm posting graphics (I hope nobody minds), here's another
interesting snapshot of the 8 diner/ 12 fork / 3 forks at a time
case. Both snapshots are taken as the test is ending:

Lock in order:


try/back off/with yield:


The graphs show that the lock-in-order algorithm isn't making full use
of the cpu's in the final seconds of the test. The reason for this is
that some diners "hogged" the system and got thru eating while other
diners sat around not getting their work done. In the end there
weren't enough diners left to keep both cpu's busy (I don't recall if
we are looking at 1 or 2 diners at the end of this test).

try/back off/with yield graphic shows the same effect as described
above, but over a *much* shorter time period. Both cpu's are fully
busy in "user time" for nearly the entire test.

I hope this gives you some insights, despite the lack of a clear
answer to your question. Perhaps you or others can present a good
explanation for this behavior regarding the yields. :-)

-Howard






Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk