Boost logo

Boost :

Subject: Re: [boost] Synchronization (RE: [compute] review)
From: Kyle Lutz (kyle.r.lutz_at_[hidden])
Date: 2014-12-29 16:51:10

On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <firespot71_at_[hidden]> wrote:
> On 29/12/2014 04:40, Gruenke,Matt wrote:
>> -----Original Message-----
>> From: Boost [mailto:boost-bounces_at_[hidden]] On Behalf Of Kyle Lutz
>> Sent: Sunday, December 28, 2014 21:24
>> To: boost_at_[hidden] List
>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>>> Why block when only the source is on the host? Are you worried it might
>>>> go out of scope?
>>>> If so, that's actually not a bad point. I was just pondering how to
>>>> write exception-safe
>>>> code using local host datastructures. I guess blocking on all
>>>> operations involving them
>>>> is a simple way to ensure nothing is read or written after it's out of
>>>> scope. Not the
>>>> only way that comes to mind (nor the most efficient), but it does the
>>>> job.
>>> Yes, that is one of the major motivations. Avoiding potential
>>> race-conditions with host
>>> code accessing the memory at the same time is another. I'd be very open
>>> to other solutions.
> I find it truly confusing that an algorithm can run either synchronous or
> asynchronous, without its signature clearly and loudly indicating so. In
> template code (or in general) it can easily be +- unknown (or non-trivial to
> find out) if the input/output range refer to the host or the device, and
> thus if the algorithm will execute in synchronous or asynchronous mode ->
> and what that implies for the rest of the code around the algorithm.
> If I understood the behavior of transform correctly (and assuming that for
> device_vector the input/output ranges count as device side [?]), am I
> correct that the following can easily fail?:
> compute::command_queue queue1(context, device);
> compute::command_queue queue2(context, device);
> compute::vector<float> device_vector(n, context);
> // copy some data to device_vector
> // use queue1
> boost::compute::transform(device_vector.begin(), device_vector.end(),
> device_vector.begin(),
> compute::sqrt<float>(),
> queue1);
> // use queue2
> compute::copy(device_vector.begin(), device_vector.end(),
> some_host_vector.begin(), queue2);
> And currently the way to make this behave properly would be to force queue1
> to wait for completion of any enqueued job (note: it may be an out-of-order
> queue!) after transform has been called?

Well this is essentially equivalent to having two separate
host-threads both reading and writing from the same region of memory
at the same time, of course you need to synchronize them.

For this specific case you could just enqueue a barrier to ensure
queue2 doesn't begin its operation before queue1 completes:

// before calling copy() on queue2:

> One way could be to make algorithms simply always treated as asynchronous at
> API level (even if internally they may run synchronous) and get always
> associated with an event. Another is providing a synchronous and
> asynchronous overload. I'd certainly prefer to know if it runs synchronous
> or asynchronous just by looking at the transform invocation itself.

Well let me make this more clear: transform() always runs
asynchronously. The only algorithm you really have to worry about is
copy() as it is responsible for moving data between the host and
device and will do this synchronously. If you want an asynchronous
copy then use copy_async() which will return a future that can be used
to wait for the copy operation to complete.

> With respect to exception safety, is there any proper behavior defined by
> your library if transform has been enqueued to run in asynchronous mode, but
> before it has completed device_vector goes out of scope (e.g. due to an
> exception thrown in the host code following the transform)? Or is it the
> user's responsibility to ensure that, whatever happens, device_vector must
> live until the transform has completed?

The user must ensure that the memory being written to remains valid
until the operation completes. Simply imagine you are calling
std::transform() on a std::vector<> from a separate std::thread, you
must wait for that thread to complete its work before destroying the
memory it is writing to. Operations on the compute device can be
reasoned about in a similar manner.

>> I have some rough ideas, but they'd probably have a deeper impact on your
>> API than you want, at this stage.
>> Instead, I'm thinking mostly about how to make exception-safe use of the
>> async copy commands to/from host memory. I think async copies will quickly
>> gain popularity with advanced users, and will probably be one of the top
>> optimization tips.
>> I guess it'd be nice to have a scope guard that blocks on
>> boost::compute::event.
> Here's another sketch, also considering the points above - though I
> obviously don't know if it's doable given the implementation + other design
> considerations I might miss, so apologize if it's non-sense.
> If input/output ranges generally refer to iterators from the boost::compute
> library, then:
> -) an iterator can store the container (or other data structure it belongs
> to, if any)
> -) an algorithm can, via the iterators, "communicate" with the container(s)
> For an input operation the data must be available throughout & in unaltered
> manner from the time of enqueuing the input operation until its completion.
> So when transform (as example) is launched it can inform the input data
> container that before any subsequent modification of it to occur (including
> destruction / setting new values through iterators) it must wait until that
> input operation has completed - i.e. the first modifying operation blocks
> until that has finished. Similarly for the output range, just that for that
> also any read operation must block until the output data from the transform
> has been written to it. So:
> -) no matter what causes the destruction of containers (e.g. regularly
> end-of-block reached, exception etc.) the lifetime of the
> container/iterators extends until the asynchronous operation on it has
> finished; thus exceptions thrown are implicitly handled.
> -) to the user the code appears as synchronous with respect to visible
> behavior, but can run as asynchronous in the background.
> Obviously a full-fledged version is neither trivial nor cheap with respect
> to performance (e.g. checking any reads/writes to containers if it must
> block), let alone threading aspects. But maybe just parts of it are useful,
> e.g. deferring container destruction until no OpenCL operation is enqueued
> to work on the container (-> handling exceptions).
> I think there's a wide range for balances between what the implementation
> does automagically and what constraints are placed on the user to not do
> "stupid" things.

While these are interesting ideas, I feel like this is sort of
behavior is more high-level/advanced than what the Boost.Compute
algorithms are meant to do. I have tried to mimic as close as possible
the "iterators and algorithms" paradigm from the STL as I consider the
design quite elegant.

However, these sorts of techniques could definitely be implemented on
top of Boost.Compute. I would be very interested to see a
proof-of-concept demonstrating these ideas, would you be interested in
working on this?


Boost list run by bdawes at, gregod at, cpdaniel at, john at