Boost logo

Boost :

Subject: Re: [boost] [compute] Review
From: Kyle Lutz (kyle.r.lutz_at_[hidden])
Date: 2014-12-29 02:01:52


On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
<jbytheway+boost_at_[hidden]> wrote:
> Here's my review.
>
> 1. What is your evaluation of the design?
>
> The design of building on OpenCL seems sound, for the reasons Kyle has
> explained. The fact that it has real users is very encouraging.
>
> I don't really like the name (I was unable to guess what the library was
> about from the name), but it's not worth changing at this point.
>
> I am worried about the compile-time configuration with macros. I would
> hope that BOOST_COMPUTE_HAVE_THREAD_LOCAL could be read from
> Boost.Config. For the two features which require linking against other
> Boost libraries, my instinct would be to have them on by default, rather
> than off by default (that should be easier if and when Boost.Compute is
> part of the Boost distribution).

I agree and I would very much prefer to have these thread-safe
features on by default and will work towards improving this. Also, I
looked for a configuration macro to enable support for C++11
"thread_local" storage in Boost.Config, but could not find one.
If/when one is added I'll update Boost.Compute to use it. And like you
said, many of these changes become easier if Boost.Compute becomes a
part of the standard Boost distribution.

> It would be very nice to have range versions of all the algorithms. I
> find myself using the boost::range algorithms much more than the std::
> ones these days.

I fully agree and also I'm very much looking forward to having range
algorithms standardized in C++17.

> Similarly, it would be good to provide the safer 4-iterator versions of
> equal, is_permutation, mismatch (and any I've forgotten) as were added
> to C++14.

Fully agree. This shouldn't be too much work.

> Some minor things:
>
> - Consider renaming boost::compute::lambda to be consistent with one of
> the other placeholder-containing namespaces from std or Boost. (You can
> keep the old name as an alias for backwards-compatibility).

Yeah, I'll work on unifying the placeholders.

> - Why the default of .25 as the bernoulli_distribution parameter?

Looks like an oversight, I'll get it updated to use 0.5 just as
std::bernoulli_distribution does.

> - Did you consider using Boost.Predef for your version macro?

Currently Boost.Compute remains compatible with Boost version 1.48
which didn't include Boost.Predef (added in 1.55 I believe). If
Boost.Compute is accepted into the standard Boost distribution,
updating it to use this should be no problem. (as well as many other
Boost features added since 1.48).

> 2. What is your evaluation of the implementation?
>
> Only looked at a couple of things while trying to understand the docs.
>
> 3. What is your evaluation of the documentation?
>
> Overall, not too bad.
>
> More tutorial would be good. In particular, something using some of the
> standard algorithms.

I'll definitely work on this more. Any ideas of specific
tutorials/examples you'd like to see would be very helpful.

> Some concerns:
>
> - All the header links off
> <http://kylelutz.github.io/compute/boost_compute/reference.html> are broken.
>
> - All or most of the things in "See also" sections look like they ought
> to be links but aren't.
>
> - For all your function parameter documentation, the list of parameters
> seems to be sorted alphabetically by parameter name rather than in the
> order they are passed to the function. This is really weird...

These are all, to the best of my knowledge, bugs in the Boost
documentation toolchain. I reported these issues a while back on the
boost-doc mailing list here [1] but never found a resolution. I would
really like to get these to work, but am not knowledgable enough about
the Boost documentation pipeline (specifically the doxygen->quickbook
infrastructure) to fix it. If anyone has any ideas or could help with
this I would be very appreciative.

> - As others have mentioned, some kind of complexity estimate or promise
> for the algorithms is very important. In particular it would be good to
> understand how much parallelism they take advantage of and how they
> ought to scale with device compute capability.

Will do.

> - The warning about accumulate being slow for floats should be much more
> prominent.

Will do.

> - <http://kylelutz.github.io/compute/boost/compute/copy.html> mentions
> copying from e.g. a std::list to a compute::vector. It would be nice if
> it gave some kind of indication as to how efficient we can expect such
> an operation to be. In particular, I guess it ought to do some kind of
> buffering on the host?

Currently it will just buffer the entire range into a std::vector<T>
and then copy that to the device. If this is a bottleneck it could
also be improved to copy the data in smaller batches. Let me know if
you encounter issues with this.

> - The way I normally copy into vectors is call v.reserve(...) and then
> copy to std::back_inserter(v). I tried that with a compute::vector and
> unsurprisingly it was very inefficient. I see you already warned about
> this in the push_back docs, but perhaps it should also be highlighted in
> the documentation of compute::copy.

Yeah, this idiom is not well suited to copying into device memory. You
should prefer to do "bulky" operations using a single call to
boost::compute::copy() (which internally will call the OpenCL function
for copying large chunks of memory).

> - Is there no version of inclusive/exclusive_scan or iota using an
> arbitrary associative operator?

These are not currently supported. I will work on adding an
implementation for these and documenting them.

> - <http://kylelutz.github.io/compute/boost/compute/includes.html> it's
> unclear whether the subrange has to be contiguous.

It should have the same semantics of std::includes, which IIRC, does
not require the subrange to be contiguous.

> - min_element and max_element should cross-reference minmax_element.

Will do.

> - What source of randomness does random_shuffle use?

Currently it simply calls std::random_shuffle to generate the shuffled
index sequence then uses boost::compute::scatter() to actually
re-arrange the values on the device. However I am planning on
deprecating it and providing an implementation of the C++11 shuffle()
algorithm instead which will use the random number generator provided
by the user.

> - Why does reduce return its result through an OutputIterator?

In order to allow the algorithm to store the result in device memory
and avoid a host-device synchronization point (which would be
necessary if it simply returned the reduced result).

> - <http://kylelutz.github.io/compute/boost/compute/remove.html> Does
> this really remove things or use std::remove semantics? You should
> clarify and document the return value. (Same with remove_if, unique)

Works just the same as std::remove. I'll add more documentation for these.

> - I can see how sort_by_key might be useful, and I'm slightly surprised
> that there's no stable_sort_by_key (which seems like it would be more
> useful than stable_sort). Any particular reason?

Just not enough time to implement it yet. There is a parallel
merge-sort in the works which would be very suitable for implementing
a stable_sort_by_key(). Hopefully this should be ready soon.

> - The boost/compute/types headers docs seem to be broken or missing. In
> particular I couldn't find reference documentation for float4_ et al.

Yeah, I wasn't sure the best way to document all of these (IIRC, 10
different built in scalar types times four different vector version of
each). I'll add some information with an overview of the provided
fundamental types.

> - I was interested to see allocators, but the docs seem uninformative:
> <http://kylelutz.github.io/compute/boost/compute/pinned_allocator.html>

I'll definitely update this.

> - Given the confusion with std::future, you might want to clarify
> whether compute::future's destructor blocks.

It does not, I'll update the documentation for this.

> - Also, <http://kylelutz.github.io/compute/BOOST_COMPUTE_CLOSURE.html>
> doesn't explain what format the lists (arguments and capture) should be
> in. Preprocessor functions often take lists in other formats, such as a
> Boost.PP Sequence
> <http://www.boost.org/doc/libs/1_57_0/libs/preprocessor/doc/data/sequences.html>,
> and your example only has one-element lists, so it's not enough to
> clarify this point. (To be clear: I'm not suggesting you should use
> Boost.PP sequences, merely that you should make it clear in the docs
> whether you are).

Will do.

> - <http://kylelutz.github.io/compute/BOOST_COMPUTE_CL_CALLBACK.html> is
> empty, as is
> <http://kylelutz.github.io/compute/BOOST_COMPUTE_MAX_ARITY.html>

Will fix.

> - <http://kylelutz.github.io/compute/boost/compute/valarray.html> -
> presumably compute::valarray has overloaded arithmetic operators? Are
> they mentioned anywhere?

Actually they aren't currently implemented. I'll open an issue for
this and also document them once they are added.

> - <http://kylelutz.github.io/compute/boost/compute/vector.html> the
> unusual behaviour of vector's size_t constructor (not initializing the
> elements) should be warned more prominently.

Will do.

> - <http://kylelutz.github.io/compute/boost/compute/device.html> the
> documentation of the enum type in the synopsis seems to be messed up.
> Similarly memory_object.

Thanks, I'll look into this more.

> - <http://kylelutz.github.io/compute/boost/compute/function.html> I'm
> puzzled by this interface. What's the name for? Would one ever use
> this constructor rather than e.g. make_function_from_source?

There are a few legitimate use-cases. One is for specifying the names
of pre-existing or built-in functions (e.g.
function<float(float)>("sqrt") would call the built-in sqrt()
function). Another is for providing a type-erased holder for
lambda-functions (e.g. "function<int(int)> plus_two = _1 + 2;").
However, most uses for function<> actually come from the
BOOST_COMPUTE_FUNCTION() macro (or uses internal to Boost.Compute
itself).

> - <http://kylelutz.github.io/compute/boost/compute/field.html> My
> reading of this is that when you get the type wrong you are guaranteed
> some kind of meaningful error report, and not just undefined behaviour;
> is that right? Perhaps you should clarify.
>
> - I can't guess what atomic_min and atomic_max do.

They just forward to the corresponding OpenCL functions [2]. I'll
update the documentation to be more clear about these.

> - The right arrow ('next') link from
> <http://kylelutz.github.io/compute/boost/compute/atomic_xor.html> is
> broken (looks like the docs for the placeholders don't work).

Hmm, I'll look into this more.

> - The iterators should state what iterator category they have.

Currently Boost.Compute doesn't really have different iterator
categories, all iterators are random-access. I'll add this information
to the documentation.

> - I can't guess what svm_ptr is for. Later I discovered svm_alloc and
> svm_free. Is there an RAII way to manage these things?

These are for managing shared virtual memory objects which were added
in OpenCL 2.0. Currently there is no RAII wrappers around them and
they behave like malloc()/free(). I have plans to add an svm_allocator
which will provide an allocator for SVM objects which can be used with
the Boost.Compute containers (and also possibly standard STL
containers).

> - What does default_device do if there are no devices? Do the various
> environment variables mentioned have any kind of precedence? (Probably
> linking to the corresponding OpenCL docs is sufficient for such
> questions). Do they have to be exact matches? Are they case-sensitive?

There is a precedence but it currently is not explained in the
documentation. For most cases only one of the default device variables
is meant to be used. They are currently will match substrings and are
case-sensitive. I'll work on improving the documentation for these and
give some example usages.

> - What does find_device do if the requested device does not exist?

Currently it will return a null (default-constructed) device object. I
have considered updating this (and the default_device() function) to
throw an exception when no valid device is found so that returned
device objects are always valid.

> - Can you give examples for type_definition? My guess was that it did
> the thing which type_name later turned out to do...

Yeah, type_definition<T> will return a string with the full definition
for the type while type_name<T> returns just the name.

So for something like:

struct Foo {
    int id;
    float mass;
};

BOOST_COMPUTE_ADAPT_STRUCT(Foo, Foo, (id, mass));

The following strings will be returned:

type_name<Foo>() == "Foo"
type_definition<Foo>() == "typedef struct { int id; float mass; } Foo;"

I'll update the documentation for type_definition<T> with a better example.

> - BOOST_COMPUTE_ADAPT_STRUCT should state that it must be used at global
> scope (i.e. not within a function or other namespace).

Will do.

> - Congratulations for getting compile-time detection of internal padding
> into BOOST_COMPUTE_ADAPT_STRUCT :). You might want to mention in the
> docs that this will be detected, so people don't worry too much.

Will do. I also hope in the future to transparently support padded
structs as well and to be able to correctly copy them to/from the
device without issue.

> - It looks like the performance tests don't include the time for copying
> data to (and, where applicable, from) the device. You should mention
> this on the Performance page.

I'll add this, though it is very device/system specific. Also, there
is a performance benchmark available in Boost.Compute
("perf_copy_to_device") which could be used to measure this on your
own system.

> - You should probably add C++AMP to your things compared with in the FAQ.

Will do.

> 4. What is your evaluation of the potential usefulness of the
> library?
>
> High. I've wanted something like this in the past and not found this.
>
> 5. Did you try to use the library? With what compiler? Did you have
> any problems?
>
> I used it with g++ 4.8.2 in C++1y mode. Some quick tests seemed to work
> OK, and indeed I was able to sort some large vectors of scalars faster
> using boost::compute::sort than std::sort, which is nice.
>
> I was pleased to see that 64-bit integers were handled fine.
>
> 6. How much effort did you put into your evaluation? A glance? A
> quick reading? In-depth study?
>
> Couple of hours reading and trying simple examples.
>
> 7. Are you knowledgeable about the problem domain?
>
> I have done a couple of small projects with CUDA, and am generally
> knowledgeable, but with no previous OpenCL experience.
>
> 8. Do you think the library should be accepted as a Boost library? Be
> sure to say this explicitly so that your other comments don't obscure
> your overall opinion.
>
> Yes. OpenCL seems to have fairly broad acceptance, but is rather
> difficult to use safely and with C++-isms. This library fills a useful
> niche. In an ideal world it will eventually be rendered obsolete by
> things like parallelism entering the standard library, but I doubt that
> will happen soon if at all. This provides a good solution now, and we
> shouldn't dismiss it simply because something sparkly is approaching,
> which may or may not turn out to be awesome and widely adopted.
>
> This library has clearly been a lot of work. Congratulations on making
> it this far, I hope the review is not too stressful :).

Thanks for the review! Your feedback has been very helpful!

-kyle

[1] http://lists.boost.org/boost-docs/2013/07/5077.php
[2] https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomic_min.html


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk