Boost logo

Boost :

Subject: Re: [boost] [thread] Alternate future implementation and future islands.
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-03-21 10:33:04


On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:

> What's special about memory allocation here? Intuitively sharing futures on
> the stack might actually be worse especially if you have multiple futures
> being serviced by different threads.

I think memory allocation is entirely wise for shared_future. I think
auto allocation is wise for future, as only exactly one of those can
exist at any time per promise.

> > On Intel RMW is the same speed as non-atomic ops unless the cache
> > line is Owned or Shared.
>
> Yes if the thread does not own the cache line the communication cost dwarfs
> everything else, but in the normal case of a exclusive cache line, mfence,
> xchg, cmpxchg and friends cost 30-50
> Cycles and stall the CPU. Significantly more than the cost of non
> serialising instructions. Not something I want to do in a move constructor.

You're right and I'm wrong on this - I stated the claim above on
empirical testing where I found no difference in the use of the LOCK
prefix. It would appear I had an inefficiency in my testing code:
Agner says that for Haswell:

XADD: 5 uops, 7 latency
LOCK XADD: 9 uops, 19 latency

CMPXCHG 6 uops, 8 latency
LOCK CMPXCHG 10 uops, 19 latency

(Source: http://www.agner.org/optimize/instruction_tables.pdf)

So approximately one halves the throughput and triples the latency
with the LOCK prefix irrespective of the state of the cache line.
Additionally as I reported on this list maybe a year ago, first gen
Intel TSX provides no benefits and indeed a hefty penalty over simple
atomic RMW.

ARM and other CPUs provide load linked store conditional, so RMW with
those is indeed close to penalty free if the cache line is exclusive
to the CPU doing those ops. It's just Intel is still incapable of low
latency lock gets, though it's enormously better than the Pentium 4.

All that said, I don't see a 50 cycle cost per move constructor as at
all being a problem. Compilers are also pretty good at applying RVO
(if you don't get in the way) to elide moves with observable effects
which any move constructor using atomic must cause. The total number
of actual 50 cycle move constructors therefore generated is usually a
minimum.
 
> [Snip]
> > > A couple of months ago I was arguing with Gor Nishanov
> > > (author of MS resumable functions paper), that heap allocating the
> > > resumable function by default is unacceptable. And here I am arguing the
> > > other side :).
> >
> > It is unacceptable. Chris's big point in his Concurrency alternative
> > paper before WG21 is that future-promise is useless for 100k-1M
> > socket scalability because to reach that you must have much lower
> > latency than future-promise is capable of. ASIO's async_result system
> > can achieve 100k-1M scalability. Future promise (as currently
> > implemented) cannot.
> >
> > Thankfully WG21 appear to have accepted this about resumable
> > functions in so far as I am aware.
>
> I'm a big fan of Chris' proposal as well. I haven't seen any new papers on
> resumable functions, I would love to know were the committee is heading.

The argument between those two camps is essentially concurrency vs
parallelism. I had thought the latter had won?

My urgings by private email to those involved is that a substantial
reconciliation needs to happen between the Concurrency TS and the
Networking TS such that they are far more tightly integrated and not
these separate opposing paradigm islands. A future-promise as
efficient as async_result would be an excellent first step along such
a reconciliation.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk