Boost logo

Boost :

Subject: Re: [boost] Synchronization (RE: [compute] review)
From: Thomas M (firespot71_at_[hidden])
Date: 2014-12-29 16:19:38

On 29/12/2014 04:40, Gruenke,Matt wrote:
> -----Original Message-----
> From: Boost [mailto:boost-bounces_at_[hidden]] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 21:24
> To: boost_at_[hidden] List
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>> Why block when only the source is on the host? Are you worried it might go out of scope?
>>> If so, that's actually not a bad point. I was just pondering how to write exception-safe
>>> code using local host datastructures. I guess blocking on all operations involving them
>>> is a simple way to ensure nothing is read or written after it's out of scope. Not the
>>> only way that comes to mind (nor the most efficient), but it does the job.
>> Yes, that is one of the major motivations. Avoiding potential race-conditions with host
>> code accessing the memory at the same time is another. I'd be very open to other solutions.

I find it truly confusing that an algorithm can run either synchronous
or asynchronous, without its signature clearly and loudly indicating so.
In template code (or in general) it can easily be +- unknown (or
non-trivial to find out) if the input/output range refer to the host or
the device, and thus if the algorithm will execute in synchronous or
asynchronous mode -> and what that implies for the rest of the code
around the algorithm.

If I understood the behavior of transform correctly (and assuming that
for device_vector the input/output ranges count as device side [?]), am
I correct that the following can easily fail?:

compute::command_queue queue1(context, device);
compute::command_queue queue2(context, device);

compute::vector<float> device_vector(n, context);
// copy some data to device_vector

// use queue1
boost::compute::transform(device_vector.begin(), device_vector.end(),

// use queue2
compute::copy(device_vector.begin(), device_vector.end(),
               some_host_vector.begin(), queue2);

And currently the way to make this behave properly would be to force
queue1 to wait for completion of any enqueued job (note: it may be an
out-of-order queue!) after transform has been called?

One way could be to make algorithms simply always treated as
asynchronous at API level (even if internally they may run synchronous)
and get always associated with an event. Another is providing a
synchronous and asynchronous overload. I'd certainly prefer to know if
it runs synchronous or asynchronous just by looking at the transform
invocation itself.

With respect to exception safety, is there any proper behavior defined
by your library if transform has been enqueued to run in asynchronous
mode, but before it has completed device_vector goes out of scope (e.g.
due to an exception thrown in the host code following the transform)? Or
is it the user's responsibility to ensure that, whatever happens,
device_vector must live until the transform has completed?

> I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.
> Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips.
> I guess it'd be nice to have a scope guard that blocks on boost::compute::event.

Here's another sketch, also considering the points above - though I
obviously don't know if it's doable given the implementation + other
design considerations I might miss, so apologize if it's non-sense.

If input/output ranges generally refer to iterators from the
boost::compute library, then:
-) an iterator can store the container (or other data structure it
belongs to, if any)
-) an algorithm can, via the iterators, "communicate" with the container(s)

For an input operation the data must be available throughout & in
unaltered manner from the time of enqueuing the input operation until
its completion. So when transform (as example) is launched it can inform
the input data container that before any subsequent modification of it
to occur (including destruction / setting new values through iterators)
it must wait until that input operation has completed - i.e. the first
modifying operation blocks until that has finished. Similarly for the
output range, just that for that also any read operation must block
until the output data from the transform has been written to it. So:
-) no matter what causes the destruction of containers (e.g. regularly
end-of-block reached, exception etc.) the lifetime of the
container/iterators extends until the asynchronous operation on it has
finished; thus exceptions thrown are implicitly handled.
-) to the user the code appears as synchronous with respect to visible
behavior, but can run as asynchronous in the background.

Obviously a full-fledged version is neither trivial nor cheap with
respect to performance (e.g. checking any reads/writes to containers if
it must block), let alone threading aspects. But maybe just parts of it
are useful, e.g. deferring container destruction until no OpenCL
operation is enqueued to work on the container (-> handling exceptions).
I think there's a wide range for balances between what the
implementation does automagically and what constraints are placed on the
user to not do "stupid" things.


Boost list run by bdawes at, gregod at, cpdaniel at, john at