Boost logo

Boost :

Subject: Re: [boost] [thread] Alternate future implementation and future islands.
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-03-23 17:55:04

On 21 Mar 2015 at 18:36, Giovanni Piero Deretta wrote:

> a and b are served by two different threads. if sizeof(future) is less
> than a cacheline, every time their corresponding threads move or
> signal the promise, they will interfere with each other and with this
> thread doing something. The cacheline starts bouncing around like
> crazy. And remember that with non allocating futures, promise and
> futures are touched by the other end at every move, not just on
> signal.

I hadn't even considered a non-allocating future which isn't aligned
to 64 byte multiples.

> > Additionally as I reported on this list maybe a year ago, first gen
> > Intel TSX provides no benefits and indeed a hefty penalty over simple
> > atomic RMW.
> hopefully will get better in the future. I haven't had the chance to
> try it yet, but TSX might help with implementing wait_all and wait_any
> whose setup and teardown require a large amount of atomic ops..

I posted comprehensive benchmarks on this list a year or so ago, try
5594.html. My conclusion was that first gen Intel TSX isn't ready for
real world usage - in real world code, the only benefits show up in
heavily multithreaded 90-95% read only scenarios. I tried a TSX based
hash table, and was surprised at just how much slower the single and
dual threaded scenarios were, sufficiently so I concluded that it
would be a bad idea to have TSX turned on by default for almost all
general purpose algorithm implementations.

Where TSX does prove useful though is for transactional GCC which
produces nothing like as penalised code as on non-TSX hardware.

> > ARM and other CPUs provide load linked store conditional, so RMW with
> > those is indeed close to penalty free if the cache line is exclusive
> > to the CPU doing those ops. It's just Intel is still incapable of low
> > latency lock gets, though it's enormously better than the Pentium 4.
> >
> The reason of the high cost is that RMW have sequential consistency
> semantics. On the other hand on intel plain load and stores have
> desirable load_acquire and store_release semantics and you do not need
> extra membars.

Still, if Intel could do a no extra cost load linked store
conditional, _especially_ if those could be nested to say two or four
levels, it would be a big win. That 19 cycle latency and pipeline
flush is expensive, plus the unnecessary cache coherency traffic when
you don't update the cache line. Plus, a two or four level nesting
would allow the atomic update of a pair of pointers or two pairs of
pointers, something which is 98% of the use case of Intel TSX anyway.
That said, if Intel TSX v2 didn't have the same overheads as a lock
xchg, that's even better again.


ned Productions Limited Consulting

Boost list run by bdawes at, gregod at, cpdaniel at, john at