|
Boost : |
Subject: Re: [boost] [lock-free] CDS -yet another lock-free library
From: Khiszinsky, Maxim (Maxim.Khiszinsky_at_[hidden])
Date: 2010-03-30 02:44:38
Phil Endecott wrote
> You listed x86, amd64, ia64, sparc; Boost.Atomic currently supports (I
> think) x86, amd64, alpha, ppc and arm. Of course ia64 and sparc
> implementations for Boost.Atomic would be useful.
Yes, indeed.
>> The problem is that the compiler (in some cases) generates case-based
>> code when 'order' parameter is constant for caller:
>> store( &myAtomic, 10, memory_order_relaxed) ;
>> in this case instead of ONE assembler store instruction the compiler
>> may generate many branch instruction. It is not optimal :-(. And 99%
>> of code with atomic primitives has *constant* memory_order parameter.
> I would like to think that all modern compilers could get this right,
> at least if the right level of optimisation were enabled. Can you
> please tell us in what case you have observed this?
I observed this when I analyzed the performance decreasing on x86 MS Visual C++ 2008 (with full optimization, of course).
The code like this:
static inline atomic64_t load64( atomic64_t volatile const * pMem, memory_order order )
{
atomic64_t v ;
v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ;
if ( order == memory_order_seq_cst )
fence( memory_order_seq_cst) ;
return v ;
}
and call:
atomic64_t n = load64( &myVar, memory_order_relaxed ) ;
produced asm code with spurious comparing and branch [if ( order == memory_order_seq_cst )].
After that I reorganized code using template (membar_xxx are wrappers around memory_order_xxx constants)
template <typename ORDER>
static inline atomic64_t load64( atomic64_t volatile const * pMem )
{
// Atomically loads 64bit value by SSE intrinsics
atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ;
return v ;
}
template <>
static inline atomic64_t load64<membar_seq_cst>( atomic64_t volatile const * pMem )
{
// Atomically loads 64bit value by SSE intrinsics
atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ;
fence<membar_seq_cst>() ;
return v ;
}
atomic64_t n = load64<member_relaxed>( &myVar ) ;
And this implementation works well - no spurious branch instructions.
As a result, I reorganized the CDS library to use template atomic functions.
Note it is not a global problem. In other calls of load64 I found that the code generated is well. I cannot find any regularity :-(
Maybe this is MSVC problem only - bad optimization for SSE instructions.
Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT:
void acquireLock()
{
// TATAS algorithm
while ( !tryAcquireLock() ) {
while ( m_spin.template load<membar_relaxed>() ) {
backoff() ;
// VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin.
// It seems, it is result of aggressive optimization: the code generated reads the value saved in registers
// instead of reading volatile int64 atomics in the loop
// The following compiler barrier prevents this bug
CDS_COMPILER_RW_BARRIER ; // _ReadWriteBarrier() for MSVC
}
}
}
> If it's true that compilers get this wrong then an approach like your
> suggestion should be considered. However it's not just Boost.Atomic
> that would need to be re-done but also the spec for the proposed c++0x
> features, which would be more difficult (!).
I think it is task for compiler developers. Template-based implementation is a workaround only.
However, I like templates :-)
Regards, Max
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk