Boost logo

Boost :

Subject: Re: [boost] Synchronization (RE: [compute] review)
From: Thomas M (firespot71_at_[hidden])
Date: 2014-12-29 17:58:42


On 29/12/2014 22:51, Kyle Lutz wrote:
> On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <firespot71_at_[hidden]> wrote:
>> On 29/12/2014 04:40, Gruenke,Matt wrote:
>>>
>>> -----Original Message-----
>>> From: Boost [mailto:boost-bounces_at_[hidden]] On Behalf Of Kyle Lutz
>>> Sent: Sunday, December 28, 2014 21:24
>>> To: boost_at_[hidden] List
>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>
>>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>>
>>>
>>
>> If I understood the behavior of transform correctly (and assuming that for
>> device_vector the input/output ranges count as device side [?]), am I
>> correct that the following can easily fail?:
>>
>> compute::command_queue queue1(context, device);
>> compute::command_queue queue2(context, device);
>>
>> compute::vector<float> device_vector(n, context);
>> // copy some data to device_vector
>>
>> // use queue1
>> boost::compute::transform(device_vector.begin(), device_vector.end(),
>> device_vector.begin(),
>> compute::sqrt<float>(),
>> queue1);
>>
>> // use queue2
>> compute::copy(device_vector.begin(), device_vector.end(),
>> some_host_vector.begin(), queue2);
>>
>> And currently the way to make this behave properly would be to force queue1
>> to wait for completion of any enqueued job (note: it may be an out-of-order
>> queue!) after transform has been called?
>
> Well this is essentially equivalent to having two separate
> host-threads both reading and writing from the same region of memory
> at the same time, of course you need to synchronize them.
>
> For this specific case you could just enqueue a barrier to ensure
> queue2 doesn't begin its operation before queue1 completes:
>
> // before calling copy() on queue2:
> queue2.enqueue_barrier(queue1.enqueue_marker());

Sorry I haven't expressed myself well. Yes, surely I must synchronize
it; just with the OpenCL API itself I can normally provide a (pointer to
an) cl_event when calling objectclEnqueue... functions, which can
subsequently be used quite flexibly to coordinate other operations (i.e.
not going to the command queue level).

>
>> One way could be to make algorithms simply always treated as asynchronous at
>> API level (even if internally they may run synchronous) and get always
>> associated with an event. Another is providing a synchronous and
>> asynchronous overload. I'd certainly prefer to know if it runs synchronous
>> or asynchronous just by looking at the transform invocation itself.
>
> Well let me make this more clear: transform() always runs
> asynchronously. The only algorithm you really have to worry about is
> copy() as it is responsible for moving data between the host and
> device and will do this synchronously. If you want an asynchronous
> copy then use copy_async() which will return a future that can be used
> to wait for the copy operation to complete.

Now I am really confused :) In this thread I have read, quoting:

[Gruenke, Matt]:
"My understanding, based on comments you've made to other reviewers, is
that functions like boost::compute::transform() are asynchronous when
the result is on the device, but block when the result is on the host.
This is what I'm concerned about. Is it true?"

[Kyle Lutz]
"Yes this is correct. In general, algorithms like transform() are
asynchronous when the input/output ranges are both on the device and
synchronous when one of the ranges is on the host."

This made me believe that some iterators in your library turn
compute::boost::transform into a synchronous operation, and some into an
asynchronous. So now I suppose that this does not seem to be the case ?
In comparison to the OpenCL runtime execution model can I consider all
your algorithms, except copy, basically acting like
clEnqueueNDRangeKernel calls, that is always asynchronous?

>>
>> Here's another sketch, also considering the points above.
>
> While these are interesting ideas, I feel like this is sort of
> behavior is more high-level/advanced than what the Boost.Compute
> algorithms are meant to do. I have tried to mimic as close as possible
> the "iterators and algorithms" paradigm from the STL as I consider the
> design quite elegant.
>
> However, these sorts of techniques could definitely be implemented on
> top of Boost.Compute. I would be very interested to see a
> proof-of-concept demonstrating these ideas, would you be interested in
> working on this?

Interested yes, time is currently a problem though; I'd need to
familiarize myself much deeper with your implementation.
At this stage my main concern is exception safety - how one could relief
users in a simplistic manner from the need to manually taking care that
objects do not get out-of-scope (due to an exception thrown) while an
OpenCL asynchronous operation still needs them. Note that because your
API can throw I consider exception effects to be of much greater concern
than with the (implicitly non-throwing) Khronos API; and just enqueuing
a couple of commands can make a proper, manual cleanup really easily
non-trivial.

cheers,
Thomas


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk