Subject: Re: [boost] [atomic, x86] need another pair of eyes
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2012-12-19 06:19:56
On Wed, Dec 19, 2012 at 2:36 PM, Tim Blechmann <tim_at_[hidden]> wrote:
> hi all,
> i need another pair of eyes regarding boost.atomic on x86:
> the implementation of memory barrier is merely a compiler barrier, but
> not a CPU barrier, as it is using code like:
> __asm__ __volatile__ ("" ::: "memory");
> afaict, one should use a `real' CPU barrier like "mfence" or "lock; addl
> $0,0(%%esp)". is this correct?
> apart from that, i've seen that compare_exchange is using explicit
> memory barriers before/after "cmpxchg" instructions. i somehow though
> that cmpxchg and the 8b/16b variants implicitly issue a memory barrier,
> so the resulting code would generate multiple memory barriers.
> can someone with some insights in the x86 architecture confirm this?
I'm not claiming to be a specialist in IA32 but here's my understanding.
There are several groups of functions that Boost.Atomic uses to
implement operations. The platform_fence_before/after functions are
used to prevent the compiler to reorder the generated code across the
atomic op. The functions are also used to enforce hardware fences when
required by the memory order argument. The
functions are used specifically for load and store ops in the similar
way.There are also the platform_cmpxchg32_strong/platform_cmpxchg32
functions that perform CAS; these functions are only used by the
generic CAS-based implementations (which is not the case for x86
anymore, as I rewrote the implementation for Windows).
Now, the "before" functions need only to implement write barriers to
prevent stores traveling below the atomic op. Similarly, the "after"
functions need only to implement read barriers. Of course, the
barriers are only required when the appropriate memory order is
requested by user.
On x86 memory view is almost always synchronized (AFAIK, the only
exception is non-temporal stores, which are usually finalized with
explicit mfence anyway), so unless the user requests
memory_order_seq_cst only compiler barrier will suffice. As for
memory_order_seq_cst, it requires global sequencing, and here's the
part I'm not sure about. Lock-prefixed ops on x86 are full fences
themselves, so it looks like no special hardware fence is needed in
this case either. So unless I'm missing something, mfence could be
removed as well in this case. Could somebody confirm that?