Hello,

This year's GSoC efforts spurred some project proposals concerned with accelerating Boost.uBLAS using a variety of techniques (vectorization, parallelization, GPU acceleration).

As I have worked on similar efforts in the past (e.g., http://openvsip.org/), I'd like to share some of my experience, so any future effort can learn from that.

1) choice of optimisation

The first thing to note is that there are quite a few approaches to accelerating BLAS routines, and which one to choose on a particular hardware platform depends on different criteria, including information that may be available at compile time (operand types, specific operation, etc.), or runtime (the problem size, the exact array dimensions, memory alignment, number of cores available, etc.). Worse, the platform on which the code is compiled may not even be the platform on which it is to run (unless you want to end up in a situation like ATLAS, which doesn't support cross-compilation precisely because it fine-tunes generated code by measuring performance on hardware available during the build).

This suggests a different approach, where multiple "backends" coexist (SIMD, OpenMP, OpenCL, CUDA, etc.) in parallel, and may, depending on the deployment context, be enabled individually. Then, a user may either select one of the available backends explicitly (using an appropriate API that needs to be added), or a mechanism needs to be added that allows to select the "best" backend. This selection itself could be done in different ways, either in-process ("just in time"), or out-of-process, in a profile run.

Note that, in case of GPU-based backends, it is crucial to eliminate unnecessary data movements, as they will have a huge impact on performance. Therefore, rather than naively moving data from the host to the GPU, run the operation, then move it back, on each operation, it's much better to move data "lazily", i.e. keep data on the GPU in case the next operation is also performed there. All this suggests that a good data model is crucial for such acceleration work.

2) do-it-yourself versus using existing backends

At least for certain platforms there already exist optimised "kernels", and it might be best to call those rather than reimplement them. For example, both for CUDA as well as OpenCL there exist freely distributable BLAS libraries. Thus, it might be more efficient to add adapters that allow Boost.uBLAS to call those, rather than implement its own.


All that being said, I don't think it's a good idea to let GSoC students make their own choices, hoping that those will be in line with what the Boost.uBLAS developers have planned for the future. On the other hand, such an architectural vision may not even exist as of yet, so it's hard to come up with a clear path forward, without doing some actual prototyping. But with all the above open questions, it seems there is a real danger of any project to be over-ambitious, while in the end not having any tangible results that could be re-integrated into Boost.uBLAS. I'd thus like to suggest that we scale down the expectations a bit, perhaps picking one or two self-contained ideas from the above, which can be relatively easily implemented and even validated.

Thoughts ?

Stefan
-- 

      ...ich hab' noch einen Koffer in Berlin...