Boost logo

Boost :

Subject: Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014
From: Kyle Lutz (kyle.r.lutz_at_[hidden])
Date: 2014-12-21 14:39:47

On Sun, Dec 21, 2014 at 3:44 AM, Thomas M <firespot71_at_[hidden]> wrote:
> On 20/12/2014 09:32, Kyle Lutz wrote:
>> Can you give examples of where the OpenCL wrapper types in
>> Boost.Compute are lacking?
> I missed any interface to cl::... (Khronos C++ wrapping API)

True. As of now you can convert between any of the types (by just
using the C++ wrapper types function call operator to access the
OpenCL C type and then passing that to the Boost.Compute types
constructor) like so:

cl::CommandQueue q1 = ...;
boost::compute::command_queue q2(q1());

I'm working on adding more direct support for the C++ wrapper types so
in the future you will also be able to do this:

cl::CommandQueue q1 = ...;
boost::compute::command_queue q2 = q1;

>> One particular issue that makes me hesitant is the lack of OpenCL 2.0
>> support in the "official" C++ bindings. The OpenCL 2.0 specification
>> was released over a year ago (November 2013). The first public OpenCL
>> 2.0 implementation was released by Intel three months ago (September
>> 2014, followed closely by an AMD implementation).
> It is correct that Khronos presently only offers a C++ API for OpenCL 1.2,
> but in practice the ways more serious, limiting constraint is that Nvidia
> only offers a 1.1 _implementation_ anyway. Any portable OpenCL code must
> restrict itself to 1.1.

Well portable code need not restrict itself, but merely provide
implementations of its functionality which will also work with only
OpenCL 1.1 (or even 1.0). For instance, calling some of the
higher-level algorithms in Boost.Compute will automatically dispatch
based on the version of OpenCL supported by the device and call more
efficient APIs if available, otherwise fall back to the older APIs.
Boost.Compute tries as much as possible to shield users from these
sorts of backwards-compatibility issues.

> We could discuss issues of the Khronos C++ bindings and your wrapper to no
> end, but my main point remains: The Khronos version is official and
> established; an alternative version intended for wider use must be clearly
> superior and give people strong reasons to migrate (I am not only referring
> to programming itself, but also people writing books on OpenCL, blogs etc.).
> In my eyes yours (that is core + utilities parts of your library) surely
> fails to meet that.
>> Could you let me know what parts of the core wrapper API fail to meet
>> your bar for quality?
> Just as one analysis example:
> For the plain C API clEnqueueReadBuffer (surely a commonly used API
> function) has this signature (we use it as reference):
> cl_int clEnqueueReadBuffer(cl_command_queue command_queue,
> cl_mem buffer,
> cl_bool blocking_read,
> size_t offset,
> size_t size,
> void *ptr,
> cl_uint num_events_in_wait_list,
> const cl_event *event_wait_list,
> cl_event *event);
> The Khronos C++ bindings version is:
> cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer,
> cl_bool blocking_read,
> ::size_t offset,
> ::size_t size,
> const void * ptr,
> const VECTOR_CLASS<Event> * events = NULL,
> Event * event = NULL);
> And your "main" version is:
> void enqueue_read_buffer(const buffer & buffer,
> size_t offset,
> size_t size,
> void * host_ptr,
> const wait_list & events = wait_list());
> The Khronos C++ version matches in arguments logic and order fully the C
> version, and also returns a cl_int as error code - they correspond to each
> other, easy to use one or the other. Your version a) misses the cl_bool for
> blocking/non-blocking, b) misses the last event argument, and c) throws (I
> suppose) instead of returning an error code. Let's go through these:
> a) blocking: If I remember my code inspection correctly your version is
> automatically always blocking, and a non-blocking version is (I guess?)
> given by enqueue_read_buffer_async, which has a different signature (return
> value) besides it's different name. So instead of being able to set a single
> cl_bool and pass it as standard argument I need to take care which function
> (name) to call, what it returns etc. Studying your library docs I find very
> little information, what makes them different etc.; specifically nowhere
> does it say that enqueue_read_buffer _is_ a blocking operation, it only says
> that it _enqueues_ a read command. Both functions then simply refer to
> clEnqueueReadBuffer() which does not help matters at all given the different
> signature.
> Now take a look at the Khronos C++ documentation and one encounters a ways,
> ways more detailed description. It's signature is already quite clear, and
> with the elaborate description it's really clear.

Yes, I've split the blocking and non-blocking memory copy operations
into separate functions. Personally, I've never been fond of APIs
which drastically change behavior based on a single boolean flag.
Also, this is more in line with the API provided by other libraries
like Boost.ASIO (e.g. boost::asio::read() vs.

I'll also definitely work on improving the documentation for these functions.

> b) why your version lacks the last event argument entirely, I have no idea.
> Anything that can be done with it in the C / C++ API (e.g. OpenCL
> runtime-level profiling; controlling an out-of-order command-queue / another
> commmand-queue) seems to be undoable.

My main motivation for returning "void" was to prevent users from
attempting to build asynchronous pipelines but accidentally using an
event object returned from a synchronous API function causing the
whole thing to become synchronous. But I see your point that there may
be legitimate uses for event objects associated with synchronous
operations. It should be relatively easy to update these APIs (return
void -> return event). I'll look into this.

> c) error handling: I'd much prefer some policy setting which specifies if an
> exception is thrown on error (the usual custom in C++) or an error code is
> returned by the function (the usual OpenCL behaviour).

This policy setting is implemented using the Boost Exception library
[1]. Users may define the BOOST_NO_EXCEPTIONS configuration macro
which will keep the library from throwing exceptions and instead call
a user-defined error handling function. This technique is used nearly
universally in Boost and I think is superior to approaches which would
change the function signature based on a policy configuration.

> reviewer note: My original review said that the documentation is well done,
> but this did not refer to the parts core + utilities of which I had always
> been critical.

I'll definitely continue to work on improving the documentation,
especially to address the points you've made above.

> Another, huge issue in practice, is reliability + stability. Not only does
> the OpenCL execution model by itself make committing errors relatively easy,
> but I guess many of us can tell stories of bugged OpenCL implementations
> having hit us deeply. When something goes wrong figuring out where/why
> exactly can be a really slow, frustrating experience. So with respect to
> this the bar for any additional layer, developed outside Khronos (who has
> the OpenCL implementers on board !) must also be set very very high.

I've worked hard to ensure that all error codes from OpenCL functions
are checked and any errors are properly propagated back to the user.
If you find any place where this error handling is missing, please let
me know.

> Let me reiterate that my main concern is not fixing the example above; my
> main point is that any alternative version must be rock-solid from ground up
> on the one hand, plus offer considerable benefits to warrant migration
> considerations.
> Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there yet,
> have you ever given it a thought of getting in touch with the Khronos group
> if they are interested in doing work together, merging efforts?
> Note for clarity: I am personally neither involved with the Khronos Groups,
> nor an OpenCL implementer, nor otherwise with distributing OpenCL [no book
> author etc.]; just a programmer using OpenCL.
>>> On the other hand to those rather new to OpenCL a simplified, less
>>> error-prone design would be beneficial ...
>> I have put much consideration into this and ultimately I don't feel it
>> is right for the data-copying to be made implicit and hidden away from
>> the user. ...>
>> In any case, there is a simplified, high-level API available in
>> Boost.Compute which allows kernels to directly operate on host memory.
>> See the mapped_view class [2].
> mapped_view uses a direct host-memory mapped buffer scheme, through the
> CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code to
> assert that initial suspicion - the docs don't say it, and you know other
> implementation schemes would have also been possible, e.g. it could have
> used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only very
> exceptionally useful in practice.

Umm, CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR do very different
things. In Boost.Compute, the mapped_view class provides an
abstraction around the former, while the pinned_allocator class
provides an abstraction around the latter.

> However I fully agree with you that users should not be forced to use
> exclusively the one or the other. My proposal aimed at offering a variety of
> high-level abstractions from which users can choose:
> -) if an input or output to a Boost.compute algorithm call is a plain host
> side object (e.g. a std:: container), automatically copy the data to the
> OpenCL runtime (if input), run the kernel, and automatically copy the data
> from the OpenCL runtime to the host (if output).

This is actually already implemented for the copy() and sort()
algorithms. For example, this code will automatically copy the data to
the device, sort it, and copy it back to the host:

std::vector<int> vec = ...;
boost::compute::sort(vec.begin(), vec.end());

I plan on implementing this support for host iterators more widely in
the API, just haven't had the time.

> -) if an input or output to an algorithm call is a plain OpenCL object (e.g.
> cl_mem or some C++ class wrapping it) the user will need to take care of any
> copying to host memory.

I'm not sure I understand this completely (maybe could you provide
some example code?). Currently, like in the STL, the algorithms accept
iterators rather than containers/memory objects. This is the same
approach taken in Boost.Compute. However, it's easy to wrap an
arbitrary OpenCL buffer with an iterator by using the buffer_iterator
class [2]:

// opencl memory buffer
cl_mem mem = ...;

// fill mem with zeros
boost::compute::fill_n(make_buffer_iterator<int>(mem, 0), memory_size
/ sizeof(int), 0, queue);

> -) if an input or output to an algorithm call is something which links a
> plain host object (e.g. std:: container) to an OpenCL object (e.g. cl_mem) -
> I have sketched such a class in my original post - ensure that upon any
> access to either the data become automatically synchronized.
> One can imagine additional / mixed schemes as well, temporarily disable
> default behaviour etc.

This is synchronized container concept is an interesting idea. If you
have the time/desire to draw up a working proof-of-concept I'd be very
interested in getting it integrated into Boost.Compute.

>> These are definitely ideas I've thought about and these kinds of tools
>> could all be built upon the current API...
> ...
>> See my response above, I'm not a huge fan of these automatic data
>> transfer abstractions. However, I could be in favor of offering this
>> functionality in a higher-level API.
> My recommendation and reviewer comment is: Your library shall offer
> high(er)-level abstractions of things NOT already offered by Khronos, by
> building on top of the Khronos APIs (incl. their C++ wrapper). Your STL-ish
> algorithms and their supporting functionality are one aspect here; the stuff
> just discussed another example: those who want low-level control already
> have it and can use it (through the Khronos APIs), while those who desire an
> "auto-synchronization" between host and OpenCL runtime presently do NOT have
> anything. Here the real benefit would come in.
> 2 other examples that come to mind (real-world applications from my work):
> -) smoothly support utilizing multiple devices (e.g. auto-splitting kernel
> workload across devices and data synchronization among devices; details will
> depend quite whether devices do or do not belong to the same context).

This is something I have had on the road-map for a long while. I agree
that building the infrastructure for these device-distributed
algorithms would be very useful.

> -) specifying "operations-sequences", like: "Copy input data A to OpenCL,
> copy input data B, run kernel 1, 2, an intermediate result decides if next
> run kernel 3 or 4, and then copy data C to host". Such pre-specified
> sequences implicitly also help to improve performance (e.g. only the last
> command must be non-blocking).

This is also something I have played around with. Basically I'd like
to have any API which allows users to define "pipelines" or
"task-graphs" which hook up several different
kernels/algorithms/memory-copies and produce an efficient set of
operations to stream data through and extract the results.

Any ideas you have on a potential API you'd like to see for this would
be great. There is potentially some prior art in the C++ pipelines
proposal [3] which may be interesting.

>>> 3) for float/double input containers compute::accumulate falls back to a
>>> plain serial reduction.
>> Because floating-point addition is not associative. Doing this would
>> lead to accumulate() producing different results on the device versus
>> the host.
> This argument is not clear to me. Floating-point results must, generally
> speaking, always be expected to deviate slightly between a plain C++ and
> OpenCL execution (subject to the specific device hardware + driver version +
> compilation settings used; for pretty much any kernel). I'd say that when
> someone defers any floating-point calculation to OpenCL then this is
> implicitly acknowledged, at the benefit of a parallelized execution.

Strongly disagree, the floating-point operations on the device are
well defined and their output should be identical to the host results
(barring optimizations like "-cl-fast-relaxed-math"). More concretely,
this test should always pass for any set of input data:

float data[] = { 1.1f, 2.3f, 3.4f, 4.5f };
std::vector<float> host_vec(data, data + 4);
boost::compute::vector<float> device_vec(data, data + 4, queue);

    std::accumulate(host_vec.begin(), host_vec.end(), 0),
    boost::compute::accumulate(device_vec.begin(), device_vec.end(), 0, queue)

I don't feel that sacrificing precision is always implicitly
acknowledged and prefer to leave the choice of precision vs.
performance up to the user.

>> True, the run-time compilation model provided by OpenCL does have some
>> associated overhead. There a few techniques in Boost.Compute which
>> help mitigate this ... The other is support for offline-caching of program
>> binaries. This is enabled by defining
>> "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache
>> binaries for programs so that they are only compiled once the very
>> first time they are run on the system.
> Has this been well-tested, including scenarios such as OpenCL driver version
> changes between runs and how it interacts with Nvidia's own auto-caching
> mechanism (which can drive one nuts when it fails to detect changes and
> happily uses an outdated binary).

I am very confident that the caching system works and it will properly
deal with changing driver versions. It was added nearly a year ago and
I haven't encountered any problems with it or received any bug

And it essentially duplicates the functionality of NVIDIA's offline
caching functionality but makes it available to all OpenCL platforms.
I haven't had any ill effects using both together on the same system.

>>> I would find it very useful if smart algorithms dispatch the algorithm to
>>> a
>>> plain C++ algorithm if it's really predictable that a GPU execution will
>>> just waste time.
>> I disagree, I think the call on whether to execute the algorithm on
>> the host or on an accelerator (with its associated overhead) should be
>> left up to the user (or potentially a higher-level API built on top of
>> Boost.Compute)...While I agree that this would a useful feature, I just
>> don't think Boost.Compute is the right place for that logic to live.
> It is meant as another example of a higher-level abstraction that can be
> done; in my opinion Boost.Compute would be the perfect place to put it in.
> There is no need to implement a number of the suggested higher-level
> abstractions right away (or for some: ever), it shall help finding the right
> niche for your library (now and in the long run). As such the
> recommendations / critics raised here are by no means intended to be
> destructive, I just bring in my view how I see your library best serving the
> community.

These are definitely some good ideas. Thanks for all the feedback!



Boost list run by bdawes at, gregod at, cpdaniel at, john at