Boost logo

Boost :

Subject: Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014
From: Thomas M (firespot71_at_[hidden])
Date: 2014-12-21 06:44:16


On 20/12/2014 09:32, Kyle Lutz wrote:

> Can you give examples of where the OpenCL wrapper types in
> Boost.Compute are lacking?

I missed any interface to cl::... (Khronos C++ wrapping API)

> One particular issue that makes me hesitant is the lack of OpenCL 2.0
> support in the "official" C++ bindings. The OpenCL 2.0 specification
> was released over a year ago (November 2013). The first public OpenCL
> 2.0 implementation was released by Intel three months ago (September
> 2014, followed closely by an AMD implementation).

It is correct that Khronos presently only offers a C++ API for OpenCL
1.2, but in practice the ways more serious, limiting constraint is that
Nvidia only offers a 1.1 _implementation_ anyway. Any portable OpenCL
code must restrict itself to 1.1.

We could discuss issues of the Khronos C++ bindings and your wrapper to
no end, but my main point remains: The Khronos version is official and
established; an alternative version intended for wider use must be
clearly superior and give people strong reasons to migrate (I am not
only referring to programming itself, but also people writing books on
OpenCL, blogs etc.). In my eyes yours (that is core + utilities parts of
your library) surely fails to meet that.

> Could you let me know what parts of the core wrapper API fail to meet
> your bar for quality?

Just as one analysis example:

For the plain C API clEnqueueReadBuffer (surely a commonly used API
function) has this signature (we use it as reference):

cl_int clEnqueueReadBuffer(cl_command_queue command_queue,
                            cl_mem buffer,
                            cl_bool blocking_read,
                            size_t offset,
                            size_t size,
                            void *ptr,
                            cl_uint num_events_in_wait_list,
                            const cl_event *event_wait_list,
                            cl_event *event);

The Khronos C++ bindings version is:

cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer,
                                            cl_bool blocking_read,
                                            ::size_t offset,
                                            ::size_t size,
                                            const void * ptr,
                               const VECTOR_CLASS<Event> * events = NULL,
                               Event * event = NULL);

And your "main" version is:

void enqueue_read_buffer(const buffer & buffer,
                          size_t offset,
                          size_t size,
                          void * host_ptr,
                          const wait_list & events = wait_list());

The Khronos C++ version matches in arguments logic and order fully the C
version, and also returns a cl_int as error code - they correspond to
each other, easy to use one or the other. Your version a) misses the
cl_bool for blocking/non-blocking, b) misses the last event argument,
and c) throws (I suppose) instead of returning an error code. Let's go
through these:
a) blocking: If I remember my code inspection correctly your version is
automatically always blocking, and a non-blocking version is (I guess?)
given by enqueue_read_buffer_async, which has a different signature
(return value) besides it's different name. So instead of being able to
set a single cl_bool and pass it as standard argument I need to take
care which function (name) to call, what it returns etc. Studying your
library docs I find very little information, what makes them different
etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a
blocking operation, it only says that it _enqueues_ a read command. Both
functions then simply refer to clEnqueueReadBuffer() which does not help
matters at all given the different signature.
Now take a look at the Khronos C++ documentation and one encounters a
ways, ways more detailed description. It's signature is already quite
clear, and with the elaborate description it's really clear.
b) why your version lacks the last event argument entirely, I have no
idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL
runtime-level profiling; controlling an out-of-order command-queue /
another commmand-queue) seems to be undoable.
c) error handling: I'd much prefer some policy setting which specifies
if an exception is thrown on error (the usual custom in C++) or an error
code is returned by the function (the usual OpenCL behaviour).

reviewer note: My original review said that the documentation is well
done, but this did not refer to the parts core + utilities of which I
had always been critical.

Another, huge issue in practice, is reliability + stability. Not only
does the OpenCL execution model by itself make committing errors
relatively easy, but I guess many of us can tell stories of bugged
OpenCL implementations having hit us deeply. When something goes wrong
figuring out where/why exactly can be a really slow, frustrating
experience. So with respect to this the bar for any additional layer,
developed outside Khronos (who has the OpenCL implementers on board !)
must also be set very very high.

Let me reiterate that my main concern is not fixing the example above;
my main point is that any alternative version must be rock-solid from
ground up on the one hand, plus offer considerable benefits to warrant
migration considerations.
Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there
yet, have you ever given it a thought of getting in touch with the
Khronos group if they are interested in doing work together, merging
efforts?

Note for clarity: I am personally neither involved with the Khronos
Groups, nor an OpenCL implementer, nor otherwise with distributing
OpenCL [no book author etc.]; just a programmer using OpenCL.

>> On the other hand to those rather new to OpenCL a simplified, less
>> error-prone design would be beneficial ...
>
> I have put much consideration into this and ultimately I don't feel it
> is right for the data-copying to be made implicit and hidden away from
> the user. ...>
> In any case, there is a simplified, high-level API available in
> Boost.Compute which allows kernels to directly operate on host memory.
> See the mapped_view class [2].

mapped_view uses a direct host-memory mapped buffer scheme, through the
CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code
to assert that initial suspicion - the docs don't say it, and you know
other implementation schemes would have also been possible, e.g. it
could have used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only
very exceptionally useful in practice.

However I fully agree with you that users should not be forced to use
exclusively the one or the other. My proposal aimed at offering a
variety of high-level abstractions from which users can choose:

-) if an input or output to a Boost.compute algorithm call is a plain
host side object (e.g. a std:: container), automatically copy the data
to the OpenCL runtime (if input), run the kernel, and automatically copy
the data from the OpenCL runtime to the host (if output).

-) if an input or output to an algorithm call is a plain OpenCL object
(e.g. cl_mem or some C++ class wrapping it) the user will need to take
care of any copying to host memory.

-) if an input or output to an algorithm call is something which links a
plain host object (e.g. std:: container) to an OpenCL object (e.g.
cl_mem) - I have sketched such a class in my original post - ensure that
upon any access to either the data become automatically synchronized.

One can imagine additional / mixed schemes as well, temporarily disable
default behaviour etc.

> These are definitely ideas I've thought about and these kinds of tools
> could all be built upon the current API...

...

> See my response above, I'm not a huge fan of these automatic data
> transfer abstractions. However, I could be in favor of offering this
> functionality in a higher-level API.

My recommendation and reviewer comment is: Your library shall offer
high(er)-level abstractions of things NOT already offered by Khronos, by
building on top of the Khronos APIs (incl. their C++ wrapper). Your
STL-ish algorithms and their supporting functionality are one aspect
here; the stuff just discussed another example: those who want low-level
control already have it and can use it (through the Khronos APIs), while
those who desire an "auto-synchronization" between host and OpenCL
runtime presently do NOT have anything. Here the real benefit would come in.

2 other examples that come to mind (real-world applications from my work):

-) smoothly support utilizing multiple devices (e.g. auto-splitting
kernel workload across devices and data synchronization among devices;
details will depend quite whether devices do or do not belong to the
same context).

-) specifying "operations-sequences", like: "Copy input data A to
OpenCL, copy input data B, run kernel 1, 2, an intermediate result
decides if next run kernel 3 or 4, and then copy data C to host". Such
pre-specified sequences implicitly also help to improve performance
(e.g. only the last command must be non-blocking).

>> 3) for float/double input containers compute::accumulate falls back to a
>> plain serial reduction.
>
> Because floating-point addition is not associative. Doing this would
> lead to accumulate() producing different results on the device versus
> the host.

This argument is not clear to me. Floating-point results must, generally
speaking, always be expected to deviate slightly between a plain C++ and
OpenCL execution (subject to the specific device hardware + driver
version + compilation settings used; for pretty much any kernel). I'd
say that when someone defers any floating-point calculation to OpenCL
then this is implicitly acknowledged, at the benefit of a parallelized
execution.

> True, the run-time compilation model provided by OpenCL does have some
> associated overhead. There a few techniques in Boost.Compute which
> help mitigate this ... The other is support for offline-caching of program
> binaries. This is enabled by defining
> "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache
> binaries for programs so that they are only compiled once the very
> first time they are run on the system.

Has this been well-tested, including scenarios such as OpenCL driver
version changes between runs and how it interacts with Nvidia's own
auto-caching mechanism (which can drive one nuts when it fails to detect
changes and happily uses an outdated binary).

>> I would find it very useful if smart algorithms dispatch the algorithm to a
>> plain C++ algorithm if it's really predictable that a GPU execution will
>> just waste time.
>
> I disagree, I think the call on whether to execute the algorithm on
> the host or on an accelerator (with its associated overhead) should be
> left up to the user (or potentially a higher-level API built on top of
> Boost.Compute)...While I agree that this would a useful feature, I just
> don't think Boost.Compute is the right place for that logic to live.

It is meant as another example of a higher-level abstraction that can be
done; in my opinion Boost.Compute would be the perfect place to put it in.

There is no need to implement a number of the suggested higher-level
abstractions right away (or for some: ever), it shall help finding the
right niche for your library (now and in the long run). As such the
recommendations / critics raised here are by no means intended to be
destructive, I just bring in my view how I see your library best serving
the community.

cheers,
Thomas


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk