Boost logo

Boost :

Subject: Re: [boost] Interest in a GPU computing library
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2012-09-18 14:00:58


On 09/18/2012 06:28 PM, Kyle Lutz wrote:

> *** Why not target CUDA and/or support multiple back-ends? ***
>
> CUDA and OpenCL are two very different technologies. OpenCL works by
> compiling C99 code at run-time to generate kernel objects which can
> then be executed on the GPU. CUDA, on the other hand, works by
> compiling its kernels using a special compiler (nvcc) which then
> produces binaries which can executed on the GPU.

The company I work at has technology to generate both CUDA (at
compile-time) and OpenCL (at runtime) kernels from expression templates.

At the moment we have support for element-wise, global and partial
reduction across all dimensions as well as partial scanning across all
dimensions. Element-wise function combinations can be merged into a
single reduction and scanning kernel.

Everything is automatically streamed and retrieved as needed and data is
cached on the device when possible, with a runtime deciding the right
amount of memory and computing resources to allocate for each
computation depending on the device capabilities.

Therefore, I do not think both CUDA and OpenCL is an impossible problem.
People want CUDA for a simple reason: CUDA is still faster than
equivalent OpenCL on NVIDIA hardware.

I think however that automatic kernel generation is a whole problem of
its own, and should be clearly separated from the distribution and
memory handling logic.

> This is the reason why I wrote the Boost.Compute lambda library.
> Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and
> transforms them into C99 source code fragments (e.g. “input[i] *
> sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute
> STL-style algorithms for execution. While not perfect, it allows the
> user to write code closer to C++ that still can be executed through
> OpenCL.

 From your description, it looks like you've reinvented the wheel there,
causing needless limitations and interoperability problems for users.

It could have just been done by serializing arbitrary Proto transforms
to C99, with extension points for custom tags.

With CUDA, you'd actually have hit the problem that the Proto functions
are not marked __device__, but with OpenCL it doesn't matter.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk