Boost logo

Boost :

Subject: Re: [boost] [lock-free] CDS -yet another lock-free library
From: Khiszinsky, Maxim (Maxim.Khiszinsky_at_[hidden])
Date: 2010-03-30 02:44:38


Phil Endecott wrote

> You listed x86, amd64, ia64, sparc; Boost.Atomic currently supports (I
> think) x86, amd64, alpha, ppc and arm. Of course ia64 and sparc
> implementations for Boost.Atomic would be useful.

Yes, indeed.

>> The problem is that the compiler (in some cases) generates case-based
>> code when 'order' parameter is constant for caller:
>> store( &myAtomic, 10, memory_order_relaxed) ;
>> in this case instead of ONE assembler store instruction the compiler
>> may generate many branch instruction. It is not optimal :-(. And 99%
>> of code with atomic primitives has *constant* memory_order parameter.

> I would like to think that all modern compilers could get this right,
> at least if the right level of optimisation were enabled. Can you
> please tell us in what case you have observed this?

I observed this when I analyzed the performance decreasing on x86 MS Visual C++ 2008 (with full optimization, of course).
The code like this:

static inline atomic64_t load64( atomic64_t volatile const * pMem, memory_order order )
{
  atomic64_t v ;
  v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ;
  if ( order == memory_order_seq_cst )
     fence( memory_order_seq_cst) ;
  return v ;
}

and call:

atomic64_t n = load64( &myVar, memory_order_relaxed ) ;

produced asm code with spurious comparing and branch [if ( order == memory_order_seq_cst )].
After that I reorganized code using template (membar_xxx are wrappers around memory_order_xxx constants)

template <typename ORDER>
static inline atomic64_t load64( atomic64_t volatile const * pMem )
{
   // Atomically loads 64bit value by SSE intrinsics
   atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ;
   return v ;
}
template <>
static inline atomic64_t load64<membar_seq_cst>( atomic64_t volatile const * pMem )
{
   // Atomically loads 64bit value by SSE intrinsics
   atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ;
   fence<membar_seq_cst>() ;
   return v ;
}

atomic64_t n = load64<member_relaxed>( &myVar ) ;

And this implementation works well - no spurious branch instructions.
As a result, I reorganized the CDS library to use template atomic functions.

Note it is not a global problem. In other calls of load64 I found that the code generated is well. I cannot find any regularity :-(

Maybe this is MSVC problem only - bad optimization for SSE instructions.
Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT:
void acquireLock()
{
   // TATAS algorithm
   while ( !tryAcquireLock() ) {
        while ( m_spin.template load<membar_relaxed>() ) {
           backoff() ;
         // VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin.
         // It seems, it is result of aggressive optimization: the code generated reads the value saved in registers
         // instead of reading volatile int64 atomics in the loop
         // The following compiler barrier prevents this bug
         CDS_COMPILER_RW_BARRIER ; // _ReadWriteBarrier() for MSVC
      }
   }
}

> If it's true that compilers get this wrong then an approach like your
> suggestion should be considered. However it's not just Boost.Atomic
> that would need to be re-done but also the spec for the proposed c++0x
> features, which would be more difficult (!).

I think it is task for compiler developers. Template-based implementation is a workaround only.
However, I like templates :-)

Regards, Max

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk