Boost logo

Boost :

Subject: Re: [boost] [afio] Formal review of Boost.AFIO
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-08-30 16:06:36


On 30 Aug 2015 at 15:05, Agustín K-ballo Bergé wrote:

> On 8/30/2015 1:01 PM, Niall Douglas wrote:
> > I appreciate that from your perspective, it's a question of good
> > design principles, and splashing shared_ptr all over the place is not
> > considered good design. For the record, I*agree* where the overhead
> > of a shared_ptr*could* be important - an*excellent* example of that
> > case is std::future<T> which it is just plain stupid that those use
> > memory allocation at all, and I have a non memory allocating
> > implementation which proves it in Boost.Outcome. But for AFIO, where
> > the cost of a shared_ptr will always be utterly irrelevant compared
> > to the operation cost, this isn't an issue.
>
> Let's get this memory allocation concern out of the way. One just can't
> have a conforming implementation of `std::future` that does not allocate
> memory. Assume that you could, by embedding the storage for the result
> (value-or-exception) inside either of the `future/promise`:

Firstly I just wanted to say this is a really comprehensive and well
written summary of the issues involved. One wouldn't have thought
future<T> to be such a large tapestry, but as you demonstrate very
well it is.

I'll just limit my comments to your text to what my Boost.Outcome
library does if that's okay. I should stress before I begin that I
would not expect my non-allocating futures to be a total replacement
for STL futures, but rather a complement to them (they are in fact
dependent on STL futures because they use them internally) which
might be useful as a future quality-of-implementation optimisation if
and only if certain constraints are satisified. My non-allocating
futures are only useful these circumstances:

1. You are not using an allocator for the T in future<T>.

2. Your type T has either a move or copy constructor or both.

3. The cost of T's move (or copy) constructor is low.

4. Your type T is not the error_type (typically std::error_code) nor
the exception_type (typically std::exception_ptr).

5. sizeof(T) is small.

6. If you want your futures to have noexcept move constructors and
assignment, your T needs the same.

7. future.wait() very rarely blocks in your use scenario i.e. most if
not nearly all the time the future is ready. If you are blocking, the
cost of any thread sleep will always dwarf the cost of any future.

These circumstances are common enough in low latency applications
such as ASIO and using them is a big win in ASIO type applications
over STL futures. These circumstances are not common in general
purpose C++ code, and probably deliver little benefit except maybe a
portable continuations implementation on an older STL.

All the above is in my documentation to warn people away from using
them with the wrong expectations.

> The reason `std::shared_future` cannot make use of embedded storage,
> thus necessarily requiring allocation, has to do with lifetime and
> thread-safety. `std::shared_future::get` returns a reference to the
> resulting value, which is guaranteed to be valid for as long as there is
> at least one instance of `std::shared_future` around. If embedded
> storage were to be used, it would imply moving the location of the
> resulting value when the instance holding it goes away. This can happen
> in a separate thread, as `std::shared_future` and
> `std::shared_future::get` are thread-safe. All in all it would lead to
> the following scenario:
>
> std::shared_future<T> s = get_a_shared_future_somehow();
> T const& r = s.get();
> std::cout << r; // potentially UB, potentially race

Boost.Outcome implements its std::shared_future equivalent using a
wrap of std::shared_ptr for all the reasons you just outlined. For
shared futures as defined by the standard it cannot be avoided,
particularly that get() must behave a certain way.

You could implement a non-consuming future without unique storage
using Boost.Outcome's framewowrk i.e. future.get() returns a value,
not a const lvalue ref and you can call future.get() as many times as
you like. This is how I was planning to implement
afio::future<>::get_handle() which as that is a shared_ptr, its
storage moving around is not a problem.

> Such an implementation would use embedded storage under those
> partly-runtime conditions, which is quite a restricted population but
> still promising as it covers the basic `std::future<int>` scenario. But
> as it usually happens, it is a tradeoff, as such an implementation would
> have to incur synchronization overhead every time either of the
> `std::future/promise` is moved for the case where the `std::future` is
> retrieved before the value is ready, which in my experience comprises
> the majority of the use cases.

I've not found this in my synthetic benchmarks. In fact, because the
entire of a future<T> or a promise<T> fits into a single cache line
(where sizeof(T) is small), performance under contention (which is
only the case from promise::get_future() until promise::set_value()
which detaches the pair) is excellent.

As for real world benchmarks, I haven't tried these yet. I'll find
out soon. It could be these show a penalty.

> Finally, in Lenexa the SG1 decided to accept as a defect LWG2412, which
> allows for (I) and (II) to happen concurrently (previously undefined
> behavior). This appears to not have yet moved forward by LWG yet. It
> represents the following scenario:
>
> std::promise<int> p;
> std::thread t([&] { p.set_value(42); });
> std::future<int> f = p.get_future();
>
> which is in reality no different than the previous scenario, but which
> an embedded storage `std::promise` implementation needs to address with
> more synchronization.

My implementation implements this defect resolution.

> Why is this synchronization worth mention at all? Because it hurts
> concurrency. Unless you are in complete control of every piece of code
> that touches them and devise it so that no moves happen, you are going
> to see the effects of threads accessing memory of other threads with all
> what it implies. But today's `std::future` and `std::promise` are
> assumed to be cheaply movable (just a pointer swap). You could try to
> protect from it by making `std::future` and `std::promise` as long as a
> cache line, and even by simply using dynamic memory allocation for them
> together with an appropriate allocator specifically designed to aid
> whatever use case you could have where allocation time is a constraint.
>
> And finally, let's not forget that the Concurrency TS (or actually the
> futures continuation section of it) complicates matters even more. The
> addition of `.then` requires implementations to store an arbitrary
> Callable around until the future to which it was attached becomes ready.
> Arguably, this Callable has to be stored regardless of whether the
> future is already ready, but I'm checking the final wording and it
> appears that you can as-if run the continuation in the calling thread
> despite not being required (and at least discouraged in an initial
> phase).

I read the postconditions as meaning:

if(future.is_ready())
  callable(future);
else
  store_in_promise_for_later(callable);

... which is what I've implemented.

I *do* allocate memory for continuations, one malloc per continuation
added.

> Similar to the earlier allocator case, this Callable can be
> whatever so it involves type-erasure in some way or another, which will
> require memory allocation whenever it doesn't fit within a dedicated
> small buffer object.

Correct. In my case, I have sixteen bytes available which isn't
enough for a small buffer object, hence the always-allocate.

> To sum things up (based on my experience and that of others which I had
> a chance to discuss the subject), a non-allocating quasi-conformant
> `std::future/promise` implementation would cater only to a very limited
> set of types in highly constrained scenarios where synchronization
> overhead is not a concern. In real word scenarios, and specially those
> that rely heavily on futures due to the use of continuations, time is
> better spent by focusing in memory allocation schemes (the actual real
> concern after all) by using the standard mechanism devised to tend to
> exactly those needs: allocators.

I concur.

> I'll be interested in hearing your findings during your work on the
> subject. And would you want me to have a look at your implementation and
> come up with ways to "break it" (which is what I do best), you have just
> to contact me.

A number of people have complained by private email to me why I don't
ship lightweight futures in the next few weeks as "they look ready".
You've just summarised perfectly why not.

In my case, my gut instinct is that these lightweight futures will be
a major boon to the AFIO engine - my first step is a direct replace
all of the STL futures with lightweight futures and not touching
anything else. My gut instinct is that I'll gain maybe 5% on maximum
dispatch.

But equally I might see a regression, and it could even turn out to a
terrible idea. In which case I'll return to the drawing board. Until
the numbers are in, I won't decide one way or another.

BTW I welcome any help in breaking them, once they are ready for that
which I currently expect will be early 2016 assuming I don't find
they are a terrible idea. I need to take a few months off after the
pace of the last seven months. My thanks in advance to you for the
offer of help, I'll definitely take it when the time comes.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk