Boost logo

Boost :

Subject: Re: [boost] [thread] Alternate future implementation and future islands.
From: Giovanni Piero Deretta (gpderetta_at_[hidden])
Date: 2015-03-21 14:36:19

On Sat, Mar 21, 2015 at 2:33 PM, Niall Douglas
<s_sourceforge_at_[hidden]> wrote:
> On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:
>> What's special about memory allocation here? Intuitively sharing futures on
>> the stack might actually be worse especially if you have multiple futures
>> being serviced by different threads.
> I think memory allocation is entirely wise for shared_future. I think
> auto allocation is wise for future, as only exactly one of those can
> exist at any time per promise.

I wasn't talking about shared future. Think about something like this,
assuming that the promise has a pointer to the future:

   future<X> a = async(...);
   future<X> b = async(...);
   ... do something which touch the stack...

a and b are served by two different threads. if sizeof(future) is less
than a cacheline, every time their corresponding threads move or
signal the promise, they will interfere with each other and with this
thread doing something. The cacheline starts bouncing around like
crazy. And remember that with non allocating futures, promise and
futures are touched by the other end at every move, not just on

>> > On Intel RMW is the same speed as non-atomic ops unless the cache
>> > line is Owned or Shared.
>> Yes if the thread does not own the cache line the communication cost dwarfs
>> everything else, but in the normal case of a exclusive cache line, mfence,
>> xchg, cmpxchg and friends cost 30-50
>> Cycles and stall the CPU. Significantly more than the cost of non
>> serialising instructions. Not something I want to do in a move constructor.
> You're right and I'm wrong on this - I stated the claim above on
> empirical testing where I found no difference in the use of the LOCK
> prefix. It would appear I had an inefficiency in my testing code:
> Agner says that for Haswell:
> XADD: 5 uops, 7 latency
> LOCK XADD: 9 uops, 19 latency
> CMPXCHG 6 uops, 8 latency
> LOCK CMPXCHG 10 uops, 19 latency
> (Source:
> So approximately one halves the throughput and triples the latency
> with the LOCK prefix irrespective of the state of the cache line.

That's already pretty good actually. On Sandy and Ivy was above 20
clocks. Also the comparison shouldn't be with their non-locked
counterparts (which aren't really ever used and complete), but with
plain operations. Finally there is the hidden cost of preventing any
OoO execution which won't appear in a synthetic benchmark.

> Additionally as I reported on this list maybe a year ago, first gen
> Intel TSX provides no benefits and indeed a hefty penalty over simple
> atomic RMW.

hopefully will get better in the future. I haven't had the chance to
try it yet, but TSX might help with implementing wait_all and wait_any
whose setup and teardown require a large amount of atomic ops..

> ARM and other CPUs provide load linked store conditional, so RMW with
> those is indeed close to penalty free if the cache line is exclusive
> to the CPU doing those ops. It's just Intel is still incapable of low
> latency lock gets, though it's enormously better than the Pentium 4.

The reason of the high cost is that RMW have sequential consistency
semantics. On the other hand on intel plain load and stores have
desirable load_acquire and store_release semantics and you do not need
extra membars.

-- gpd

Boost list run by bdawes at, gregod at, cpdaniel at, john at