Boost logo

Boost :

Subject: Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014
From: Kyle Lutz (kyle.r.lutz_at_[hidden])
Date: 2014-12-20 03:32:39


On Fri, Dec 19, 2014 at 1:07 PM, Thomas M <firespot71_at_[hidden]> wrote:
> Dear all,
>
>> Review of the Compute library starts today
>
> Having spent the last 3 years on a larger-scale C++ project utilizing OpenCL
> as main computing engine, without doubt a C++ GPGPU library is worthwhile.
> My evaluation was based on studying the full docs, tutorial examples,
> creating own (simple) applications, and inspecting selected implementation
> details.

Thanks for the review! I've addressed your comments in-line below:

> I liked most the portable STL-like algorithms; combined with fairly
> straightforward on-the-fly specifications of kernel functions (including
> lambda expression support) GPGPU utilization becomes much more accessible
> for every-day C++ programs. The design of this library part is rather clean
> and aligns well with the C++ standard customs. I have checked a handful of
> function interfaces and they correspond to the C++ STL variants.
>
> However I have also encountered a number of issues, of which I consider most
> severe the overall library's design/aim:
>
> Khronos Group already provides (since years) a C++ bindings API itself
> (https://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf).
> Frankly the Khronos API is not an example of a clean, modern C++, but it
> provides very fine-grained operations control (which can be crucial for
> effective GPGPU performance), is developed (althouth a bit lagged) with
> OpenCL itself, and covered in every elaborate OpenCL textbook. It is thus
> IMHO the current de-facto C++ OpenCL wrapper standard. The proposed boost
> library's parts core & utility (heavy in total!) seem to do just the same
> interface-wise, yet lack some important features e.g. detailed control flow
> (blocking / event setting), image classes, or deviate in subtle signature
> details making it really difficult to grasp which does exactly what /
> behaves differently. A boost library should not start from scratch but
> integrate with and extend the Khronos API, both at C and C++ bindings level
> (e.g. providing the STL-like algorithms to them in straightforward to use
> manners). Programmers can thus rely on established practices (personally I
> wouldn't switch away from the Khronos C++ API as main underlying workhorse!)
> yet benefit from the extended functionality provided by the boost library.

Can you give examples of where the OpenCL wrapper types in
Boost.Compute are lacking?

I am aware of the issue with image classes. While they do exist in
Boost.Compute, their APIs are just not documented yet (though there
are quite a few examples/tests which demonstrate their usage). I'll
work on this.

One particular issue that makes me hesitant is the lack of OpenCL 2.0
support in the "official" C++ bindings. The OpenCL 2.0 specification
was released over a year ago (November 2013). The first public OpenCL
2.0 implementation was released by Intel three months ago (September
2014, followed closely by an AMD implementation). Boost.Compute had
OpenCL 2.0 support implemented a week later. As of today (over a year
since the specification was released), there is still no support for
OpenCL 2.0 in the Khronos C++ OpenCL wrapper. I don't think it would
be prudent to restrict Boost.Compute to a subset of the OpenCL API
merely because of shortcomings in the "official" C++ wrapper.

Other issues include lack of support for move-semantics (as well as
various other C++11 features) and differing object sizes for wrapped
types versus the OpenCL C API types (e.g. "sizeof(cl::Device) !=
sizeof(cl_device_id)" which prevents the ability to pass arrays of
wrapped object types directly to C APIs).

Overall I think implementing our own wrapper directly based on the C
API reduces overall complexity (no workarounds for bugs/shortcomings
in the C++ wrapper layer), allows us to fix bugs and issue updates
quickly, and provides a more consistent and, IMHO, better API for
OpenCL.

Also, as they both just wrap the underlying OpenCL C API,
interoperating between Boost.Compute wrapper types and the Khronos C++
wrapper types is trivial.

If there are specific issues with the wrappers in Boost.Compute please
submit a bug report [1].

> On the other hand to those rather new to OpenCL a simplified, less
> error-prone design would be beneficial; equally this can raise the
> productivity of everyone. The current design/implementation follows the
> typical OpenCL execution model incl. some of its caveats (see tutorial
> "Transforming Data"): explicitly copying input data to the device, executing
> kernel(s), and then copying the output back to the host. Frequently however
> the whole emphasis is on the kernel invocation (run an algorithm on the
> data), rendering the copying an implementation detail that's "done because
> it must be done" but otherwise makes code longer and comprises an error
> source if forgotten.

I have put much consideration into this and ultimately I don't feel it
is right for the data-copying to be made implicit and hidden away from
the user. I prefer the API which makes the memory copying operations
and the host-device synchronization points explicit. These
"under-the-hood" copy operations can be major sources of
under-performing GPU-based applications and I don't think insulating
the programmer from them makes the situation better.

In any case, there is a simplified, high-level API available in
Boost.Compute which allows kernels to directly operate on host memory.
See the mapped_view class [2].

> I hence wonder if the following overall design would be more appropriate
> (this is by no means a request to doing it this way, I just try to bring in
> alternative perspectives):
>
> 1) build on top the Khronos C / C++ bindings API, i.e. use that as base
> instead of the own core + utilities parts
> 2) offering a high-level interface for algorithm execution that exposes
> users to as little OpenCL internals as possible while giving the algorithms
> lots of flexibility
> 3) offering a high-level interface that auto-connects STL-containers with
> OpenCL memory objects, implemented based on standard C++ / Khronos API
> classes.
> 4) offering a high-level interface that applies the algorithms directly to
> objects of the Khronos C / C++ API.
>
>
> The first point is rather obvious. To me the proposed library parts core +
> utility appear as just another C++ wrapper, and unless this is done in
> extremely (!!!) well manner (i.e. offering every functionality the Khronos
> API does, yet making it clean from scratch, aligning it with Standard C++,
> extending it by essential features etc., ensuring a rock-solid quality
> control for reliability etc.; I'd set the bar really high here) I see no
> reason to do it. If people are forced to use the proposed library core
> wrapper just to gain access to the other functionality (and there is good
> other functionality in there !) then I think there is a serious risk that a
> considerable number of people simply turn away altogether.

Well I would question the premise that the Khronos C++ wrapper is done
in an "extremely well manner" and has "rock-solid quality" ;-).

Could you let me know what parts of the core wrapper API fail to meet
your bar for quality?

> With respect to the second point I suppose something like that is doable:
>
> // BEGIN
>
> compute::gpgpu_engine gpgpuEngine; // default initialization links it to a
> default device etc.
>
> // create vector, fill with data - ordinary C++
> std::vector<float> vec(10000);
> std::generate(vec.begin(), vec.end(), rand);
>
> compute::transform(vec.begin(),
> vec.end(),
> vec.begin(),
> compute::sqrt<float>(),
> gpgpuEngine);
>
> std::cout << vec[0]; // results are already on the host
>
> // END
>
> So an instance of gpgpu_engine (or whatever name) gets setup once and can,
> if needed, become customized for its behaviour (devices used, execution
> policies etc.). This engine can then internally (hidden to the user, like a
> std::vector manages it's memory) manage buffers, and when transform now gets
> invoked it:
> -) copies the data to one of its buffers (create one if none available)
> -) run the kernel
> -) copies the data to the host side container (keep buffer for reuse later)
>
> This would bring several advantages:
> -) the code becomes very similar to ordinary C++ code -> short-handed, less
> error-prone
> -) buffers can be recycled among multiple algorithm calls (e.g. the engine
> can cache a small number of buffers immediately available for future calls)
> -) more efficient OpenCL runtime utilization (performance) because the whole
> operation sequence has been abstracted: e.g. the input copy operation can be
> enqueued in a non-blocking fashion, so the data transfer to the device and
> the boost.compute kernel preparation can occur concurrently; equally while
> the kernel runs the copy-back-to-host command can already become enqueued.
> -) gpgpu_engine can encapsulate a number of policies that control its
> behaviour (providing sensible default-configurations but allowing
> fine-grained control if desired), e.g.: error handling (e.g. throwing
> exception vs. setting some plain-odd OpenCL error codes); device to execute
> on (if multiple available); copying back to the host in non-blocking manner
> (like copy_async); to allow the selection of a 'smart' execution path (for
> example if the input data do not warrant the overhead of a GPU call [e.g. if
> they are too small or do not well fit GPU computation problems] defer the
> call a plain STL-algorithm call or use OpenCL's built-in native C++ calling
> threading functionality); etc. It would be beneficial if those options can
> become temporarily overwritten (something like the boost::ios_..._saver
> classes come to mind).
>
>
> With respect to the third point I am thinking of something along the lines
> of:
>
> template <class T, ... policies etc for both std::vector and cl::Buffer>
> class vector_buffer
> {
>
> private:
> std::vector<T,...> vec;
> cl::Buffer buf;
> };
>
> the class ensures that the std::vector and the buffer are automatically
> synchronized whenever changes must become transparent (i.e. access).
> Obviously this requires some thought if get functions grant access to the
> plain std::vector / cl_mem/cl::Buffer; however for the present
> implementation I also don't see what would stop me from hijacking a cl_mem
> from a compute::vector and modify the buffer arbitrarily outside the
> compute::vector class.

These are definitely ideas I've thought about and these kinds of tools
could all be built upon the current API. I've played around with
basically a "kernel functor" abstraction which provides a class with
the regular C++ "operator()(Args...)" function which stores internally
an OpenCL kernel along with a command queue and any buffers it needs.
This would then provide a simple, pure C++ interface for executing
functions on the GPU that looks just like a normal C++ function call.
I'd be very interested in pursing this more.

> For 4) I am thinking of overloads for the algorithm to +- directly accept
> Khronos C / C++ objects. Some really light-weight adapters adding e.g.
> required type + size data could do the trick.

This is fairly trivial. I have a patch in the works which would allow
the Khronos C++ types to be passed anywhere that Boost.Compute wrapper
types are used to make this all virtually transparent to the user.

> In general I'd find it useful that all of a host object, a device object and
> something linking a host with a device object can be an input / output of an
> algorithm, and the implementation takes care of automatic data transfer. So
> if the input refers to a host object the implementation automatically copies
> the data to the device, if the output is a host object it also automatically
> copies the result to it etc.

See my response above, I'm not a huge fan of these automatic data
transfer abstractions. However, I could be in favor of offering this
functionality in a higher-level API.

> A final but probably very important design consideration:
> I wonder if boost needs a OpenCL-computing library, or a general
> parallelization library. Presently the GPGPU world is already split too much
> between CUDA and OpenCL as main players (hardware vendors doing their parts
> ...), and technology is really rapidly moving (APUs etc.).
> As Hartmut has already pointed out one approach could be to use the current
> proposal as foundation for a parallelization implementation: cut it down to
> the essentials of that and hide as much OpenCL implementation details as
> possible.
> A completely different approach could be to try coming up with a unifying
> parallelization framework that supports multiple backends (OpenCL, CUDA,
> others). Obviously this would be a tremendous amount of work (and getting
> that API right is probably extremely difficult - the last thing we'd need is
> just another restricted API causing more splitting) but in the long run
> could be the more rewarding solution.

I think developing a unifying parallel framework which can
intelligently dispatch algorithms to multiple back-ends is outside the
scope of Boost.Compute (and seemingly much more in the remit of the
proposed C++ Parallelism TS). I think OpenCL is more than sufficient
for running code efficiently across a variety of parallel hardware
including GPUs, multi-core CPUs, FPGAs and various other accelerators
which is why I chose it to build Boost.Compute on top of (along with
it being cross-platform and an open-standard).

> Implementation details:
>
> I have checked the implementation only briefly, mostly only when questions
> arose for a few functions. Overall it looks ok and organized, yet I have
> encountered some issues.
>
> 1) type safety [major issue - must be fixed before acceptance]:
> Type safety is not a strength of OpenCL, and this is reflected at parts in
> the implementation when it fails to add a proper conversion/protection
> layer. Using compute::reduce it was embarrassingly easy to produce rubbish
> results through the following code (modifying the provided tutorial code) :
>
> // BEGIN
>
> compute::device device = compute::system::default_device();
> compute::context context(device);
> compute::command_queue queue(context, device);
>
> // generate random data on the host - type is float
> std::vector<float> host_vector(10000);
> std::generate(host_vector.begin(), host_vector.end(), rand);
>
> // create a vector on the device
> compute::vector<float> device_vector(host_vector.size(), context);
>
> // transfer data from the host to the device
> compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(),
> queue);
>
> double reduction_result = 0.0; // result is of type double
> compute::reduce(device_vector.begin(), device_vector.end(),
> &reduction_result, queue);
> std::cout << "result: " << reduction_result<< std::endl;
>
> // END
>
> The input data is of type float while the result shall be stored in a
> double. This fails miserably under the current implementation because after
> the reduction has completed the final value stored in device memory gets
> copied merely byte-wise to the target variable (using a plain type-ignorant
> clEnqueueReadBuffer), reading simply the 4 bytes from a float into an 8 byte
> double (4/8 on my PC machine). I suppose reversing types (double as input,
> float as output) will be even more spectacular because 4 superfluous bytes
> simply overwrite the stack.
>
> The same affects a plain compute::copy, for example if above the
> device_vector is of type double:
>
> // BEGIN
>
> // generate random data on the host - type is float
> std::vector<float> host_vector(10000);
> std::generate(host_vector.begin(), host_vector.end(), rand);
>
> // create a vector on the device - type is double
> compute::vector<double> device_vector(host_vector.size(), context);
>
> // transfer data from the host to the device
> compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(),
> queue);
>
> // END
>
> it equally makes pang because the data are just copied byte-wise.
>
> The library must provide a strict type-safety for all algorithms / data
> structures, where (in order of preference that comes to mind):
> a) convert properly to target type if possible (above surely applicable)
> b) issue compile-time error if conversions not possible
> c) last fallback: throw a proper exception at runtime

Thanks for providing these tests cases. I'll work on improving
type-checking and error-reporting. If possible, could you submit these
as bug reports to the issue tracker [1]?

> 2) when inspecting the code flow in above copy operation I missed a debug
> mode check that for a copy operation the output range can hold that many
> elements; something like a safe iterator returned by device_vector.begin()
> -> a good implementation should throw an organized error instead of just
> overwriting memory.

This is a very good idea. I'll work on adding more assertions to
verify these sorts of pre-conditions for the algorithms.

> 3) for float/double input containers compute::accumulate falls back to a
> plain serial reduction, making element-wise additions (which is really slow
> on a GPU). This is because can_accumulate_with_reduce returns false as it is
> not defined for integral types. Is there a technical reason why it cannot
> work for floating types? How many algorithms are affected by a possible
> fallback to a plain serial execution order?

Because floating-point addition is not associative. Doing this would
lead to accumulate() producing different results on the device versus
the host. If the user is willing to trade performance for accuracy,
they can call the reduce() algorithm directly. However, I don't think
Boost.Compute should make this call universally and thus it is left up
to the user to decide.

I think this is documented in the accumulate() function, I'll look
into making it more clear.

> 4) for types not supported under both OpenCL and C++ (e.g. long double,
> bool, half) more specific error messages would be useful.

Noted, I'll work on producing better error messages for unsupported
types/operations.

> Note: above are listed only issues which I have encountered during my few
> trials; there's no claim whatsoever for complete coverage
>
>
> Performance:
>
> I have not really tested performance so I cannot say much on it.
> At times I spotted what appears as unnecessary OpenCL runtime overhead (e.g.
> blocking commands, resetting kernel arguments upon each invocation) but I am
> not familiar enough with the implementation to judge if this really just
> redundant.

If you find any unnecessary blocking/synchronization please report it
to the bug tracker [1] and we'll get it fixed.

> Invoking the OpenCL compiler always takes considerable time for any OpenCL
> program. The library compiles kernels on demand when encountered to execute;
> while this is technically reasonable I am not sure in how far it is clear to
> everyone (foremost end-users of programs) that e.g. a simply accumulate of
> 100 ints may take several seconds to execute in total on first invocation
> simply because the kernel compilation takes that time. I guess this
> considerable penalty also somewhat discourages from using the library to
> create a number of kernels on the fly.

True, the run-time compilation model provided by OpenCL does have some
associated overhead. There a few techniques in Boost.Compute which
help mitigate this. The first is a run-time program cache which keeps
built versions of commonly used kernels ready so that the compilation
cost is only payed once per application run (see the program_cache
class [3]). The other is support for offline-caching of program
binaries. This is enabled by defining
"BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache
binaries for programs so that they are only compiled once the very
first time they are run on the system.

I'll work on documenting the compilation-overhead issues as well as
the caching functionality better.

> I would find it very useful if smart algorithms dispatch the algorithm to a
> plain C++ algorithm if it's really predictable that a GPU execution will
> just waste time (I have elaborated on this above). It's fairly trivial to
> have data/algorithm combinations that are better not executed on the GPU,
> being able to rely on some auto-mechanism would relief programmers.

I disagree, I think the call on whether to execute the algorithm on
the host or on an accelerator (with its associated overhead) should be
left up to the user (or potentially a higher-level API built on top of
Boost.Compute). While I agree that this would a useful feature, I just
don't think Boost.Compute is the right place for that logic to live.

> Answers to reviewer questions:
>
> 1. What is your evaluation of the design?
>
> See comments above
>
>
> 2. What is your evaluation of the implementation?
>
> See comments above
>
>
> 3. Documentation:
>
> Overall I find the documentation well written and structured. As minor
> issues at times it could be more explicit / elaborate (e.g. for
> compute::vector a short description is provided, but for several other
> containers there is none; what are the accepted types for predicates in the
> algorithms etc.).

Will do.

> The installation page misses that (on Windows) the OpenCL headers must be
> explicitly included (it leaves the impression that the library would do it
> on it's own).

Hmm, the Boost.Compute headers do "#include <CL/cl.h>" where it is
required. Does this not work correctly for you? If possible, could you
report a bug for this with a bit more information?

> The performance page should include more details with respect to overhead
> and specifically "unintuitive" ones such as kernel compilation time.
> Recommendations can be given when a problem is / is not suitable for GPU
> execution. It is unclear if the provided measurements refer to float or
> double; measurements for both should be provided.

Will do.

> 4. Overall usefulness of the library
>
> I find the portable STL-ish algorithms (+ their supplements) useful. With
> respect to the core + utilities parts I think unnecessary competition with
> existent wrappers is introduced.
>
>
> 5. Did you try to use the library?
>
> I used MSVC 12 and had no problems installing it and running a few little
> programs.
>
>
> 6. How much effort did you put into your evaluation?
>
> I read the documentation, ran tutorial code and created a few example
> programs on my own (I did not run any of the pre-packaged examples). When
> questions arose I took close looks at the implementation (thus I focused
> more on in-depth analyses of selected components instead of testing overall
> into breadth).
>
>
> 7. Are you knowledgeable about the problem domain?
>
> I consider myself knowledgeable of the problem domain.
>
>
> 8. The core question: Do you think the library should be accepted as a Boost
> library?
>
> Generally speaking yes but I recommend a major revision before acceptance is
> reconsidered. My greatest concern revolves around the overall design aim. I
> don't like the idea of competing with the Khronos API for general wrapping;
> I'd prefer a more light-weight library with the STL-ish algorithms (or other
> things) at the core, adding to what is already out there. The library needs
> to find its own niche, minimizing overlap and elaborating on it's
> novelty/strengths. The implementation must become more robust, presently it
> is ways too trivial to break things.

Thanks again for the review! There are lots of good ideas raised here.
Let me know if you have any other feedback or if I can explain
anything further/more clearly.

-kyle

[1] https://github.com/kylelutz/compute/issues
[2] http://kylelutz.github.io/compute/boost/compute/mapped_view.html
[3] http://kylelutz.github.io/compute/boost/compute/program_cache.html


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk