Boost logo

Threads-Devel :

From: Dmitriy Vyukov (dvyukov_at_[hidden])
Date: 2008-05-04 15:20:41


> From: Anthony Williams <anthony_at_[hidden]>
>
>
>> Why it always executes at least one
>> BOOST_INTERLOCKED_COMPARE_EXCHANGE? In documentation I don't see any
>> requirements that try_lock() have to synchronize memory in the case of
>> failure. Why it's not implemented as:
>>
>> bool try_lock()
>> {
>> long old_count=active_count;
>> while(!(old_count&lock_flag_value));
>> {
>> long const current_count =
>> BOOST_INTERLOCKED_COMPARE_EXCHANGE
>>
>> (&active_count,(old_count+1)|lock_flag_value,old_count);
>> if(current_count==old_count)
>> {
>> return true;
>> }
>> old_count=current_count;
>> }
>> return false;
>> }
>>
>
> No particular reason.
Then, I think, it's worth doing. Because the whole point of try_lock()
is "lock mutex or impose as low overhead as possible".

> Thanks for the suggestion. Actually, I've been
> intending to change the implementation to use a BitTestAndSet
> instruction where it can, so try_lock becomes:
>
> return !bit_test_and_set(&active_count,lock_flag_bit);
>
> But even then, it might be worth adding a simple non-interlocked read
> before-hand to check for the flag, and only do the BTS if it's not set.
>
>

Well, this complicates situation a bit.

This version (1):
return !bit_test_and_set(&active_count,lock_flag_bit);
has very good property that it requests cache-line in modified state
instantly.

And this version (2):
if (! (active_count &lock_flag_bit )
  return false;
return !bit_test_and_set(&active_count,lock_flag_bit);
requests cache-line in shared state, and only after that in modified state.

If you are targeting at try_lock() success then version (1) is better.
If you are targeting at try_lock failure then version (2) is better.
It's reasonable to target to success because it's "uncontented case".
But it's also reasonable to make failed try_lock() as lightweight as
possible... I'm not sure which version is better at the end...

I've seen 3x scalability degradation under high-load on quad-core
between following versions:

1:
return 0 == XCHG(&var, 1); // request cache-line in modified state

2:
int local = var; // request cache-line in shared state
return local == CAS(&var, local, local+1); // request cache-line in
modified state

My understanding is that reason for scalability degradation is exactly
cache coherence traffic.

Dmitriy V'jukov


Threads-Devel list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk