Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] Synchronization (RE: [compute] review)
From: Kyle Lutz (kyle.r.lutz_at_[hidden])
Date: 2014-12-29 18:49:35

Next message: Gruenke,Matt: "Re: [boost] Synchronization (RE: [compute] review)"
Previous message: Kyle Lutz: "[boost] [config] Macro for detecting thread_local storage support"
In reply to: Thomas M: "Re: [boost] Synchronization (RE: [compute] review)"
Next in thread: Thomas M: "Re: [boost] Synchronization (RE: [compute] review)"
Reply: Thomas M: "Re: [boost] Synchronization (RE: [compute] review)"
Reply: Rob Stewart: "Re: [boost] Synchronization (RE: [compute] review)"

On Mon, Dec 29, 2014 at 2:58 PM, Thomas M <firespot71_at_[hidden]> wrote:
> On 29/12/2014 22:51, Kyle Lutz wrote:
>>
>> On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <firespot71_at_[hidden]> wrote:
>>>
>>> On 29/12/2014 04:40, Gruenke,Matt wrote:
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Boost [mailto:boost-bounces_at_[hidden]] On Behalf Of Kyle
>>>> Lutz
>>>> Sent: Sunday, December 28, 2014 21:24
>>>> To: boost_at_[hidden] List
>>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>>
>>>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>>>
>>>>
>>>>
>>>
>>> If I understood the behavior of transform correctly (and assuming that
>>> for
>>> device_vector the input/output ranges count as device side [?]), am I
>>> correct that the following can easily fail?:
>>>
>>> compute::command_queue queue1(context, device);
>>> compute::command_queue queue2(context, device);
>>>
>>> compute::vector<float> device_vector(n, context);
>>> // copy some data to device_vector
>>>
>>> // use queue1
>>> boost::compute::transform(device_vector.begin(), device_vector.end(),
>>> device_vector.begin(),
>>> compute::sqrt<float>(),
>>> queue1);
>>>
>>> // use queue2
>>> compute::copy(device_vector.begin(), device_vector.end(),
>>> some_host_vector.begin(), queue2);
>>>
>>> And currently the way to make this behave properly would be to force
>>> queue1
>>> to wait for completion of any enqueued job (note: it may be an
>>> out-of-order
>>> queue!) after transform has been called?
>>
>>
>> Well this is essentially equivalent to having two separate
>> host-threads both reading and writing from the same region of memory
>> at the same time, of course you need to synchronize them.
>>
>> For this specific case you could just enqueue a barrier to ensure
>> queue2 doesn't begin its operation before queue1 completes:
>>
>> // before calling copy() on queue2:
>> queue2.enqueue_barrier(queue1.enqueue_marker());
>
>
> Sorry I haven't expressed myself well. Yes, surely I must synchronize it;
> just with the OpenCL API itself I can normally provide a (pointer to an)
> cl_event when calling objectclEnqueue... functions, which can subsequently
> be used quite flexibly to coordinate other operations (i.e. not going to the
> command queue level).

Yes, currently this functionality is not built-in to the algorithms
API. If you want to synchronize between multiple command queues you
must currently do so explicitly.

With the enqueue_marker() method it is quite easy to get an event
object at any point in time which can be used to synchronize with the
host or with other command queues.

>>
>>> One way could be to make algorithms simply always treated as asynchronous
>>> at
>>> API level (even if internally they may run synchronous) and get always
>>> associated with an event. Another is providing a synchronous and
>>> asynchronous overload. I'd certainly prefer to know if it runs
>>> synchronous
>>> or asynchronous just by looking at the transform invocation itself.
>>
>>
>> Well let me make this more clear: transform() always runs
>> asynchronously. The only algorithm you really have to worry about is
>> copy() as it is responsible for moving data between the host and
>> device and will do this synchronously. If you want an asynchronous
>> copy then use copy_async() which will return a future that can be used
>> to wait for the copy operation to complete.
>
>
> Now I am really confused :) In this thread I have read, quoting:
>
> [Gruenke, Matt]:
> "My understanding, based on comments you've made to other reviewers, is that
> functions like boost::compute::transform() are asynchronous when the result
> is on the device, but block when the result is on the host. This is what I'm
> concerned about. Is it true?"
>
> [Kyle Lutz]
> "Yes this is correct. In general, algorithms like transform() are
> asynchronous when the input/output ranges are both on the device and
> synchronous when one of the ranges is on the host."
>
> This made me believe that some iterators in your library turn
> compute::boost::transform into a synchronous operation, and some into an
> asynchronous. So now I suppose that this does not seem to be the case ?
> In comparison to the OpenCL runtime execution model can I consider all your
> algorithms, except copy, basically acting like clEnqueueNDRangeKernel calls,
> that is always asynchronous?

Sorry, I should of prefaced that statement with "For algorithms which
accept both host and device iterators, ...". Currently there are very
few of those (copy(), sort(), and reduce() for its output can use host
iterators). In general, to maximize efficiency, the user should deal
almost entirely with device iterators and only synchronize with the
host at the beginning to transfer input and at the end to transfer
back the output.

And you're correct, nearly all calls to Boost.Compute algorithms
result in creating a kernel, setting its arguments, and then calling
clEnqueueNDRangeKernel() to execute it asynchronously.

>>>
>>> Here's another sketch, also considering the points above.
>>
>>
>> While these are interesting ideas, I feel like this is sort of
>> behavior is more high-level/advanced than what the Boost.Compute
>> algorithms are meant to do. I have tried to mimic as close as possible
>> the "iterators and algorithms" paradigm from the STL as I consider the
>> design quite elegant.
>>
>> However, these sorts of techniques could definitely be implemented on
>> top of Boost.Compute. I would be very interested to see a
>> proof-of-concept demonstrating these ideas, would you be interested in
>> working on this?
>
>
> Interested yes, time is currently a problem though; I'd need to familiarize
> myself much deeper with your implementation.
> At this stage my main concern is exception safety - how one could relief
> users in a simplistic manner from the need to manually taking care that
> objects do not get out-of-scope (due to an exception thrown) while an OpenCL
> asynchronous operation still needs them. Note that because your API can
> throw I consider exception effects to be of much greater concern than with
> the (implicitly non-throwing) Khronos API; and just enqueuing a couple of
> commands can make a proper, manual cleanup really easily non-trivial.

I have tried as much as possible to ensure that the library
exception-safe (I surely hope you aren't assuming that I intentionally
made the library non-exception-safe). If you do find any bugs related
to exception handling please do submit them with a reproducible
test-case to the issue tracker [1] and I will get them fixed as soon
as possible. Also, in case you were unaware, you can always disable
all exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS.

-kyle

[1] https://github.com/kylelutz/compute/issues

Next message: Gruenke,Matt: "Re: [boost] Synchronization (RE: [compute] review)"
Previous message: Kyle Lutz: "[boost] [config] Macro for detecting thread_local storage support"
In reply to: Thomas M: "Re: [boost] Synchronization (RE: [compute] review)"
Next in thread: Thomas M: "Re: [boost] Synchronization (RE: [compute] review)"
Reply: Thomas M: "Re: [boost] Synchronization (RE: [compute] review)"
Reply: Rob Stewart: "Re: [boost] Synchronization (RE: [compute] review)"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk