Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [compute] review
From: Kyle Lutz (kyle.r.lutz_at_[hidden])
Date: 2014-12-28 14:41:36

Next message: Larry Evans: "Re: [boost] Interactive metaprogramming shells"
Previous message: Mathias Gaunard: "Re: [boost] [compute] review"
In reply to: Gruenke,Matt: "[boost] [compute] review"
Next in thread: John Bytheway: "Re: [boost] [compute] review"
Reply: John Bytheway: "Re: [boost] [compute] review"
Reply: Gruenke,Matt: "[boost] SYCL (RE: [compute] review)"
Reply: Andrey Semashev: "Re: [boost] [compute] review"
Reply: Gruenke,Matt: "[boost] Synchronization (RE: [compute] review)"
Reply: Gruenke,Matt: "[boost] Exception-safety (RE: [compute] review)"
Reply: Gruenke,Matt: "[boost] OpenCL discrepancies (RE: [compute] review)"

On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt <mgruenke_at_[hidden]> wrote:
>> 1. What is your evaluation of the design?
>
>
>
> I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous, and ideally utilize only one mechanism for synchronization. One approach that would have made this clearer is if high-level operations were actually objects that got posted to the command_queue.

Like I mentioned before, there is only one method for asynchrony in
Boost.Compute, the command queue abstraction provided by OpenCL.
Operations are enqueued to be executed on the compute device and this
occurs asynchronously with respect to code executing on the host. The
exception to this rule are functions which interact directly with
host-memory which by default are blocking and offer explicit
"_async()" versions (in order to prevent potential race-conditions on
host-memory).

And I am hesitant to make the command_queue interface more high-level.
But this sort of functionality could certainly be built on top of
current interfaces in Boost.Compute. See some of the previous reviews
for discussion about this.

> Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.

To be clear, all copies are not synchronized with host memory.
Boost.Compute allows both synchronous and asynchronous memory
transfers between the host and device.

And having an implicitly synchronized container class could be built
on top of the Boost.Compute APIs. I haven't implemented it yet as I
personally feel that blocking operations should be done explicitly by
the user and not hidden away in an implicit API. However, I'd be happy
to review and integrate functionality like this into Boost.Compute if
someone was interested in developing it.

> Also, I agree with Thomas M that it'd be useful for operations to return events. That said, if you told me it doubled the overhead involved in dispatching operations, I'd suggest to make it optional through overloads or via a mechanism in the command_queue itself.

All asynchronous operations in the command queue class do return
events. One of his comments was to also return events from the
synchronous methods for consistency and I am working on adding this.

>> 2. What is your evaluation of the implementation?
>
>
>
> It seems a bit heavy to be header-only.
>
>
>
> I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?

I absolutely do not feel it is non-essential. The reason for it
currently being a compile-time configurable option is that enabling it
on non-C++11 compilers would make the library non-header-only (it
would require the users to link to Boost.Thread).

This has come up a few times in the review and I have been considering
making the library non-header-only (or at least optionally
non-header-only) which would obviate some of these issues. However,
this is a much larger and potentially non-backwards-compatible change
and I'd need to run it by the community and provide a migration plan
for current users.

> I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.

I'm not sure if there are any Boost conventions for/against this
(someone please speak up if there are). I chose the trailing
underscore for these types as I needed a consistent spelling for
representing the fundamental OpenCL types (e.g. "float" or "uint" or
"int4") and I couldn't just use the names without the trailing
underscore as they'd conflict with the C++ reserved keywords (e.g.
"float", "int"). Using a leading underscore (e.g. "_float4") looked
worse to me, so I used a trailing underscore. But I'd definitely be
open to hearing other ideas.

> I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.

There are already some tests which for the error/exception paths in
Boost.Compute (testing building invalid programs in "test_program.cpp"
is the first that comes to mind). I'll work on adding more. Please let
me know if there are any specific areas you'd like to see more heavily
tested.

> Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.

Can you be more specific as to what you consider to be a "bug" here?
These APIs in the command_queue class are very "thin" wrappers on top
of the clEnqueue* functions from the OpenCL C API. If you could
provide a small test-case which demonstrates the issue and submit it
to the bug-tracker [1] that would be great.

>> 3. What is your evaluation of the documentation?
>
>
>
> It needs more and better-documented tutorials (and I'm including the "Advanced Topics", here).

Fully agree, I'll work on this.

> Some of the reference pages could use more details, as well, especially concerning synchronization and exception usage.

Fully agree. I've been working on improving this. And please let me
know if there are any specific classes/functions which you'd like to
see more detailed documentation.

> Regarding exceptions, there should be discussion of specifically what is invalidated by which kinds of errors. If any errors are non-recoverable, this should also be called out and perhaps derived from a different or additional baseclass (not a bad use case for multiple inheritance, actually).

As far as I know, there are no non-recoverable errors or errors which
invalidate host objects thrown by Boost.Compute. This is due to the
fact that Boost.Compute mainly deals with executing operations on a
separate device (e.g. GPU) and exceptions there don't affect the
host's ability to continue execution.

> I agree with Mathias Gaunard that it would be helpful to discuss computational complexity, device memory requirements, and the various methods used to implement some of the algorithms. You could continue to omit these details on the simple algorithms, though.

I will definitely continue to improve the documentation and
specifically add more of these sorts of details for the algorithms.
Please let me know if there are additional areas where you would also
like to see more specific details.

>> 4. What is your evaluation of the potential usefulness of the library?
>
>
>
> This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA.
>
>
>
> To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
>
>
>
>> 5. Did you try to use the library? With what compiler? Did you have any problems?
>
>
>
> I downloaded and inspected the sources and some examples. I didn't have any questions or concerns that would be addressed by experimentation.
>
>
>
>
>
>> 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
>
>
>
> Review of the documentation, some sources, some examples, and a refresher on SYCL & Bolt. I also read all of the reviews and replies sent thus far.
>
>
>
>
>
>> 7. Are you knowledgeable about the problem domain?
>
>
>
> I've been developing production code for parallel and heterogeneous computing platforms for the majority of the last 16 years. I've followed OpenCL since version 1.0 was finalized. I've also submitted bug reports and minor patches for Khronos' official C++ OpenCL interface.
>
>
>
>
>
>> 8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
>
>
>
> Not in its current state. The primary issue is that it's too OpenCL-centric. It should support more backends, even if these are just compile-time options. This is crucial for both current and lasting relevance.

Boost.Compute is explicitly *not* designed to be an abstraction layer
over every parallel-programming interface currently available. It is
designed to target modern programmable GPUs, multi-core CPUs and some
of the more exotic accelerators (Xeon Phi, Parallella Epiphany, FPGAs,
etc.). In my opinion, OpenCL is the current best and most
widely-supported interface for this purpose and fills this role well.

Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC,
etc.) are all dependent on either a special compiler or special
compiler extensions. On the other hand, OpenCL is a library-level
solution which allows Boost.Compute to be portable to any platform
with a standard C++ compiler.

And as Boost.Compute and SYCL are both based on OpenCL, interoperating
between them should be fairly trivial. If a non-commercial SYCL
implementation is ever released, I will work on adding supporting for
its single-source compilation model to Boost.Compute.

I feel developing an all-encompassing parallel programming library is
outside the scope of Boost.Compute (and would require considerably
more resources to design, implement, and maintain). That said, I feel
that Boost.Compute is well-suited to provide a back-end to a more
generic/high-level parallel programming library (either something like
a potential Boost.Parallel library or the upcoming standard C++
Parallelism TS).

More concretely, and as Sebastian mentioned, OpenCL itself already
provides a parallel-programming abstraction with many different
back-ends (both vendor-supplied implementations and open-source
implementations). I personally don't see the benefit to adding yet
another abstraction layer between the user and the compute device. I
feel it would greatly increase complexity of the library and prevent
us from spending more time on improving performance.

> Beyond that, better concurrency control is crucial for optimum performance. If done well, it could even help further reduce usage errors.

Boost.Compute provides low-level abstractions for concurrency between
the host and device. More high-level APIs and frameworks could
definitely be built on top of them. However, designing an optimal,
high-level API for concurrency is notoriously difficult and, in my
opinion, still an open-problem.

Thanks for the review. Please let me know if I can explain anything further.

-kyle

[1] https://github.com/kylelutz/compute/issues

Next message: Larry Evans: "Re: [boost] Interactive metaprogramming shells"
Previous message: Mathias Gaunard: "Re: [boost] [compute] review"
In reply to: Gruenke,Matt: "[boost] [compute] review"
Next in thread: John Bytheway: "Re: [boost] [compute] review"
Reply: John Bytheway: "Re: [boost] [compute] review"
Reply: Gruenke,Matt: "[boost] SYCL (RE: [compute] review)"
Reply: Andrey Semashev: "Re: [boost] [compute] review"
Reply: Gruenke,Matt: "[boost] Synchronization (RE: [compute] review)"
Reply: Gruenke,Matt: "[boost] Exception-safety (RE: [compute] review)"
Reply: Gruenke,Matt: "[boost] OpenCL discrepancies (RE: [compute] review)"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk