Subject: Re: [boost] interest in an asynchronous linear algebra library?
From: Hartmut Kaiser (hartmut.kaiser_at_[hidden])
Date: 2015-12-03 16:39:06
> There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal
> in the sense that i took my existing LinAlg, ripped it apart and rewrote
> it partially to fit the new needs of the library. Only cpu is
> implemented, yet. I am open for suggestions, helpful advice and of
> course people who are interested to work on it. I am not too happy with
> the scheduling interface right now and its implementation looks a bit
> slower than necessary, but i think this will evolve over time.
The scheduling should not be part of your library, it's a pure
> two basic examples showing what the library already can do are given in
> examples/ . For ublas users this should not look too foreign.
I don't think that you should make it explicit to the user whether a certain
operation has to be async or not. Just make every operation async by default
and let the library decide when to synchronize (i.e. whenever the actual
matrix data is accessed). Any expressions (like r += 2*x+1 + prod(x,y) -
btw: why prod(x,y) and not x * y)?) should not execute any code directly,
just construct an expression dependency tree which starts scheduling the
asynchronous sub-tasks. You need to wait for the result only when somebody
accesses an element of r. The easiest would be to let your matrix type wrap
a future which references the actual matrix. Additional transformations of
the expression tree could be used to merge the operations, if needed.
> For all who are interested, here is the basic design and the locations
> in include/aBLAS/:
> 1. computational kernels are implemented in kernels/ and represent
> typical bindings to the BLAS1-3 functionality as well as a default
> implementation (currently only dot,gemv,gemm and assignment are tested
> and working, no explicit bindings included, yet). kernels are enqueued
> in the scheduler via the expression template mechanisms and kernels are
> not allowed to enqueue kernels recursively.
Why do you implement your own BLAS functions?
> 2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp.
> It implements a dependency graph between work packages and work is
> enqueued into a boost::thread::basic_thread_pool when all its
> dependencies are resolved. A kernel is enqueued together with a set of
> dependency_node objects which encapsulate dependencies of variables the
> kernel uses (i.e. every variable keeps track about what its latest
> dependencies are and whether these dependencies read from it or write to
> it). The current interface should be abstracted enough to allow
> implementation using different technologies(e.g. it should be possible
> to implement the scheduler in terms of HPX.).
> One of the tasks of the scheduler is to allow creation of closures where
> variables are guaranteed to exist until all kernels using them are
> finished as well as moving a variable into a closure. This is used to
> prevent an issue similar to the blocking destructor of std::future<T>.
> Instead of blocking, the variable is moved into the scheduler which then
> ensures lifetime, until all kernels are finished. This of course
> requires the kernels to be called in a way that they can cope with the
> moving of types.
As said, the scheduler should be a pure implementation detail and not
exposed to the user. I'd even suggest to rely on executors for the
scheduling needs, leaving the scheduler out of the picture completely.
Also, if you need dependencies, those shouldn't be handled by the executor,
but be better imposed by the code using it (via futures or similar).
> what is currently missing is "user created dependencies" to be used in
> conjunction with the gpu (as gpus are fully asynchronous, we have to
> register a callback that notifies the scheduler when the gpu is done
> with its computations just as the worker threads do).
> 3. basic matrix/vector classes are implemented in matrix.hpp and
> vector.hpp. The implementation is a bit convoluted for the "move into
> closure" to work. Basically they introduce another indirection. When a
> kernel is created, it references a special closure type of the variable
> (vector<T>::closure_type), which references that indirection.
> 4. the remaining files in include/aBLAS*.hpp implement the expression
> templates, which are similar to uBLAS. There are two types distinguished
> using the CRTP classes matrix_expression<Type,Device> and
> vector_expression<Type,Device>. Type is the exact type of the
> expression, Device marks the Device this is working on (cpu_tag or
> gpu_tag). The second template parameter ensures that one can not mix gpu
> and cpu expresions unless the operation explicitly allows this. While
> this looks clumsy, it removes code duplication between the different
> device types as most of the code can be used for both implementations.
> assignment.hpp implements the basic assignment operators (Except op=).
> A += B either calls kernel::assign<>(A,B) if B can be evaluated
> efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls
> B.plus_assign_to(A) which then assigns the terms one-by-one using their
> specialized kernels (e.g. A+=prod(B,C)+D is evaluated as
> kernels::gemm(B,C,A); kernels::assign<...>(A,D);)
> matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations,
> etc and matrix/vector_expression the algebraic operations.
Can I use other BLAS implementations, like for instance Intel MKL?
General note: I have the impression, that - just like any decently sized
system eventually embeds a half-baked and bug-ridden implementation of LISP
- any asynchronous library nowadays contains its own half-assed
implementation of a parallel runtime system.
So let me ask again: why not use a full blown, mature, and thoroughly tested
runtime system for your needs instead of worrying about how to schedule 5
dependent tasks your library creates?
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk