|
Threads-Devel : |
From: Dmitriy V'jukov (dvyukov_at_[hidden])
Date: 2008-05-04 19:31:21
> From: Anthony Williams <anthony <at> justsoftwaresolutions.co.uk>
>
>>> Thanks for the suggestion. Actually, I've been
>>> intending to change the implementation to use a BitTestAndSet
>>> instruction where it can, so try_lock becomes:
>>>
>>> return !bit_test_and_set(&active_count,lock_flag_bit);
>>>
>>> But even then, it might be worth adding a simple non-interlocked read
>>> before-hand to check for the flag, and only do the BTS if it's not set.
>>>
>
> I've checked in the BTS-based code, and would be grateful if you could
> have a look.
Ummm... I'm not sure from where I have to check-out latest version...
I've tried this one:
http://boost.cvs.sourceforge.net/boost/boost/boost/thread/
but it seems that it's wrong place...
>
>> Well, this complicates situation a bit.
>>
>> This version (1):
>> return !bit_test_and_set(&active_count,lock_flag_bit);
>> has very good property that it requests cache-line in modified state
>> instantly.
>>
>> And this version (2):
>> if (! (active_count &lock_flag_bit )
>> return false;
>> return !bit_test_and_set(&active_count,lock_flag_bit);
>> requests cache-line in shared state, and only after that in modified state.
>>
>> If you are targeting at try_lock() success then version (1) is better.
>> If you are targeting at try_lock failure then version (2) is better.
>> It's reasonable to target to success because it's "uncontented case".
>> But it's also reasonable to make failed try_lock() as lightweight as
>> possible... I'm not sure which version is better at the end...
>>
>>
>> I've seen 3x scalability degradation under high-load on quad-core
>> between following versions:
>>
>> 1:
>> return 0 == XCHG(&var, 1); // request cache-line in modified state
>>
>> 2:
>> int local = var; // request cache-line in shared state
>> return local == CAS(&var, local, local+1); // request cache-line in
>> modified state
>>
>> My understanding is that reason for scalability degradation is exactly
>> cache coherence traffic.
>
> Yes: two accesses implies up to two cache-line transfers. If another
> CPU/core modified the value in between, the cache-line has to bounce
> to the other CPU and back.
>
> Have you got access to a quad-core or true multiprocessor machine for testing?
Yes, I episodically have access to quad-core Q6600 machine at home :)
It acts actually more like 2-core 2-processor machine because of
shared L2$. It makes situation even more interesting!
Dmitriy V'jukov