Boost logo

Boost :

Subject: Re: [boost] [thread] Alternate future implementation and future islands.
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-03-20 09:03:34


On 20 Mar 2015 at 9:19, Giovanni Piero Deretta wrote:

> > > > Your future still allocates memory, and is therefore costing about
> > > > 1000 CPU cycles.
> > >
> > > 1000 clock cycles seems excessive with a good malloc implementation.
> >
> > Going to main memory due to a cache line miss costs 250 clock cycles,
> > so no it isn't. Obviously slower processors spin less cycles for a
> > cache line miss.
>
> Why would a memory allocation necessarily imply a cache miss. Eh you are
> even assuming an L3 miss, that must be a poor allocator!

LL caching is irrelevant, the problem here is the MOESI protocol
which renders caching void.

What you ideally want is unidirectional forward only cache line
movement between threads. So thread A modifies a line (M), signals
thread B who then shares that line (S) and thread A's line becomes
owned (O). If thread A nor thread B never modifies that line after
thread B has shared it, performance will be excellent, even under
very heavy load, because main memory is never involved.

If on the other hand thread A or thread B does modify a shared or
owned line, it gets more complicated. Newer architectures can be
quite good at exchanging the S and O states if exactly the right two
caches have that line (e.g. two CPUs on the same die). As soon as you
go past two caches sharing a line though, all bets are off and
everything gets synced through main memory.

All the above is why memory allocation is bad here because any shared
state can usually ends up being worst case performance for MOESI.
Modern memory allocators are very good at handling the same thread
freeing a block it itself allocated when only that thread saw that
allocation. They are poor at handling when a different thread frees,
or even has visibility of, a block allocated by another thread.

One brutally effective way of making memory allocation MOESI friendly
is forcing 64 byte alignment on all blocks. So you're right that a
custom allocator could be useful here, though the problem with shared
states is that they very easily leak visibility to more than two
threads. However because future-promise is extremely likely to have a
different thread free a block than the thread which allocated it,
you're going to have to introduce quite a bit of latency inducing
machinery in your custom allocator to handle that. Even if you get
the average low, you're going to see big latency spikes from time to
time.

One technique is to have a dedicated worker thread which never
sleeps, only spins, whose sole purpose is to free allocations made by
all the other threads. You can really get memory allocation latencies
down with that technique, but it's unacceptable on anything mobile or
consumer focused.

> I am aware of that solution My issue with that design is that it require an
> expensive rmw for every move. Do a few moves and it will quickly dwarf the
> cost of an allocation, especially considering that an OoO will happily
> overlap computation with a cache miss, while the required membar will stall
> the pipeline in current CPUs (I'm thinking of x86 of course). That might
> change in the near future though.

On Intel RMW is the same speed as non-atomic ops unless the cache
line is Owned or Shared. There is a small penalty on ARM admittedly,
but on the Cortex A15 I have here it is not much.

> > > I understand what you are aiming at, but I think that the elidability is
> > > orthogonal. Right now I'm focusing on making the actual synchronisation
> > > fast and composable in the scenario where the program has committed to
> make
> > > a computation async.
> >
> > This is fine until your compiler supports resumable functions.
>
> This is funny :). A couple of months ago I was arguing with Gor Nishanov
> (author of MS resumable functions paper), that heap allocating the
> resumable function by default is unacceptable. And here I am arguing the
> other side :).

It is unacceptable. Chris's big point in his Concurrency alternative
paper before WG21 is that future-promise is useless for 100k-1M
socket scalability because to reach that you must have much lower
latency than future-promise is capable of. ASIO's async_result system
can achieve 100k-1M scalability. Future promise (as currently
implemented) cannot.

Thankfully WG21 appear to have accepted this about resumable
functions in so far as I am aware. My hope is we can demonstrate that
future-promise is easily 1M scalable using C++ 14 constexpr and lazy
or no shared state allocation.

> OK my compromise is to not allocate while the async operation is merely
> deferred but can still be executed synchronously. Lazily convert to heap
> allocation only when the operation needs to be executed truly
> asynchronously, basically until you actually create the promise (at that
> point the cost of the async setup will provably dwarf the allocation; and
> even in this case the allocation can be skipped if we know we will sync
> before a move, then it us safe to allocate the shared state on the stack).
> This should allow to compiler to remove the abstraction completely if it
> can prove it safe. Still working on it, should have something in a few
> days.
>
> I guess I'm converging to your design.

I think the lazy conversion to shared state is a very wise design
goal. Simple single future waits shouldn't need it, but for
simplicity of implementation it's probably easier, and you're
deallocating in the same thread as allocation, so on that you're
good.

> > It provides a hook API with filter C functions which can, to a
> > limited extent, provide some custom functionality. Indeed the file
> > descriptor/HANDLE toggling is implemented that way. There is only so
> > much genericity which can be done with C.
>
> I believe my design is much simpler and flexible; then again is trying to
> solve a different and narrower problem than full scale synchronization of
> arbitrary threads.

I think compatibility with C is highly important. It is after all the
most popular systems programming language, with excellent support
from almost all other programming languages.

As AFIO is expected to come with fully featured C bindings in the
future, my need for a C compatible event/result object is obviously
unavoidable. If C had one of course we'd be good, but it does not and
POSIX did not want to add the one I proposed.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk