Subject: Re: [boost] Futures (was: Re: [compute] Some remarks)
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-01-12 10:30:29
On 7 Jan 2015 at 12:40, Thomas Heller wrote:
> > What is missing on POSIX is a portable universal kernel wait object
> > used by everything in the system. It is correct to claim you can
> > easily roll your own with a condition variable and an atomic, the
> > problem comes in when one library (e.g. OpenCL) has one kernel wait
> > object and another library has a slightly different one, and the two
> > cannot be readily composed into a single wait_for_all() or
> > wait_for_any() which accepts all wait object types, including
> > non-kernel wait object types.
> Exactly, this could be easily achieved by defining an appropriate API for the
> shared state of asynchronous operations, the wait functions would then just
> use the async result objects, which in turn use to wait the functionality as
> implemented in the shared state.
You still seem to be assuming the existence of a shared state in wait
I suppose it depends on how you define a shared state, but for that
non-allocating design of mine the (a better name) "notification
target" is the promise if get_future() has never been called, and the
future if get_future() has ever been called. The notification target
is kept by an atomic pointer, if he is set he points at a future
somewhere, if he is null then either the promise is broken or the
target is the promise.
> A portable, universal kernel wait object is
> not really necessary for that.
I think a portable, universal C API kernel wait object is very
necessary if C++ is to style itself as a first tier systems
We keep trivialising C compatibility, and we should not.
> Not everyone wants to pay for the cost of a
> kernel transition.
You appear to assume a kernel transition is required.
My POSIX permit object can CAS lock spin up to a certain limit before
even considering to go acquire a kernel wait object at all, which I
might add preferentially comes from a user side recycle list where
possible. So if the wait period is very short, no kernel transition
is required, indeed you don't even call malloc.
That said, its design is highly limited to doing what it does because
it has to make hard coded conservative assumptions about its
surrounding environment. It can't support coroutines for example, and
the fairness implementation does make it quite slow compared to a CAS
lock because it can't know if fairness is important or not, so it
must assume it is. Still, this is a price you need to pay if you want
a C API which cannot take template specialisations.
> This is an implementation detail of a specific future
> island, IMHO. Aside from that, i don't want to limit myself to POSIX.
My POSIX permit object also works perfectly on Windows using the
Windows condition variable API. And on Boost.Thread incidentally, I
patch in the Boost.Thread condition_variable implementation. That
gains me the thread cancellation emulation support in Boost.Thread
and makes the boost::permit<> class fairly trivial to implement.
> > > Ok. Hands down: What's the associated overhead you are talking
> > > about? Do you have exact numbers?
> > I gave you exact numbers: a 13% overhead for a SHA256 round.
> To quote your earlier mail:
> "The best I could get it to is 17 cycles a byte, with the scheduling
> (mostly future setup and teardown) consuming 2 cycles a byte, or a
> 13% overhead which I feel is unacceptable."
> So which of these "mostly future setup and teardown" is related to exception
> handling? Please read http://www.open-std.org/Jtc1/sc22/wg21/docs/TR18015.pdf
> from page 32 onwards.
> I was under the impression that we left the "exceptions are slow" discussion
> way behind us :/
I didn't claim that. I claimed that the compiler can't optimise out
the generation of exception handling boilerplate in the present
design of futures, and I personally find that unfortunate. The CPU
will end up skipping over most of the generated opcodes, and without
much overhead if it has a branch predictor, but it is still an
unfortunate outcome when futures could be capable of noexcept. Then
the compiler could generate just a few opcodes in an ideal case when
compiling a use of a future.
With regard to the 13% overhead above, almost all of that overhead
was the mandatory malloc/free cycle in present future
> > 1. Release BindLib based AFIO to stable branch (ETA: end of January).
> > 2. Get BindLib up to Boost quality, and submit for Boost review (ETA:
> > March/April).
> Just a minor very unrelated remark. I find the name "BindLib" very confusing.
The library is a toolkit for locally binding libraries into
namespaces :). It means that library A can be strongly bound to vX of
library B, while library C can be strongly bound to vY of library B,
all in the same translation unit. This was hard to do in C++ until
C++ 11, and it's still a non-trivial effort though BindLib takes away
a lot of the manual labour.
> > I might add that BindLib lets the library end user choose what kind
> > of future the external API of the library uses. Indeed BindLib based
> > AFIO lets you choose between std::future and boost::future, and
> > moreover you can use both configurations of AFIO in the same
> > translation unit and it "just works". I could very easily - almost
> > trivially - add support for a hpx::future in there, though AFIO by
> > design needs kernel threads because it's the only way of generating
> > parallelism in non-microkernel operating system kernels (indeed, the
> > whole point of AFIO is to abstract that detail away for end users).
> *shiver* I wouldn't want to maintain such a library. This sounds very
> dangerous and limiting. Note that both boost::future and hpx::future are far
> more capable than the current std::future with different performance
A lot of people expressed that opinion before I started BindLib -
they said the result would be unmaintainable, unstable and by
implication, the whole idea was unwise.
I thought they were wrong, and now I know they are wrong. Future
implementations, indeed entire threading implementations, are quite
substitutable for one another when they share a common API, and can
even coexist in the same translation unit surprisingly well. One of
the unit tests for AFIO compiles a monster executable consisting of
five separate builds of the full test suite each with a differing
threading, filesystem and networking library, all compiled in a
single repeatedly reincluded all header translation unit. It takes
some minutes for the compiler to generate a binary. The unit tests,
effectively looped five times but with totally different underlying
library dependency implementations, all pass all green.
You might think it took me a herculean effort to implement that. It
actually took me about fifteen hours. People overestimate how
substitutable STL threading implementations are, if your code already
can accept any of Dinkumware vs SGI vs Apple STL implementations,
it's a very small additional step past that.
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk