|
Ublas : |
Subject: [ublas] adding multicore and GPU support: thoughts and strategies
From: Stefan Seefeld (stefan_at_[hidden])
Date: 2018-03-15 20:37:11
Hello,
This year's GSoC efforts spurred some project proposals concerned with
accelerating Boost.uBLAS using a variety of techniques (vectorization,
parallelization, GPU acceleration).
As I have worked on similar efforts in the past (e.g.,
http://openvsip.org/), I'd like to share some of my experience, so any
future effort can learn from that.
1) choice of optimisation
The first thing to note is that there are quite a few approaches to
accelerating BLAS routines, and which one to choose on a particular
hardware platform depends on different criteria, including information
that may be available at compile time (operand types, specific
operation, etc.), or runtime (the problem size, the exact array
dimensions, memory alignment, number of cores available, etc.). Worse,
the platform on which the code is compiled may not even be the platform
on which it is to run (unless you want to end up in a situation like
ATLAS, which doesn't support cross-compilation precisely because it
fine-tunes generated code by measuring performance on hardware available
during the build).
This suggests a different approach, where multiple "backends" coexist
(SIMD, OpenMP, OpenCL, CUDA, etc.) in parallel, and may, depending on
the deployment context, be enabled individually. Then, a user may either
select one of the available backends explicitly (using an appropriate
API that needs to be added), or a mechanism needs to be added that
allows to select the "best" backend. This selection itself could be done
in different ways, either in-process ("just in time"), or
out-of-process, in a profile run.
Note that, in case of GPU-based backends, it is crucial to eliminate
unnecessary data movements, as they will have a huge impact on
performance. Therefore, rather than naively moving data from the host to
the GPU, run the operation, then move it back, on each operation, it's
much better to move data "lazily", i.e. keep data on the GPU in case the
next operation is also performed there. All this suggests that a good
data model is crucial for such acceleration work.
2) do-it-yourself versus using existing backends
At least for certain platforms there already exist optimised "kernels",
and it might be best to call those rather than reimplement them. For
example, both for CUDA as well as OpenCL there exist freely
distributable BLAS libraries. Thus, it might be more efficient to add
adapters that allow Boost.uBLAS to call those, rather than implement its
own.
All that being said, I don't think it's a good idea to let GSoC students
make their own choices, hoping that those will be in line with what the
Boost.uBLAS developers have planned for the future. On the other hand,
such an architectural vision may not even exist as of yet, so it's hard
to come up with a clear path forward, without doing some actual
prototyping. But with all the above open questions, it seems there is a
real danger of any project to be over-ambitious, while in the end not
having any tangible results that could be re-integrated into
Boost.uBLAS. I'd thus like to suggest that we scale down the
expectations a bit, perhaps picking one or two self-contained ideas from
the above, which can be relatively easily implemented and even validated.
Thoughts ?
Stefan
-- ...ich hab' noch einen Koffer in Berlin...