 # Boost :

Subject: [boost] interest in an asynchronous linear algebra library?
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 2015-11-19 09:55:52

Hi list,

Disclaimer: What I am proposing here does not exist (yet). But it will
in the near future and if there is interest i would like to give
something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it?
It is a linear algebra library where all computations are carried out in
an asynchronous manner on GPU and CPU. Computing a matrix-matrix product
does not block the current thread but instead launches a task which at
some point returns with a result, allowing the main thread to continue
until the result is needed. GPU and CPU tasks can be started at the same
time depending on each other (e.g. a CPU computation waits until the GPU
returned some needed result and vice-versa)

Why is this important? Which problems does it solve?
Many linear algebra libraries are target towards either
a) solving a few large linear algebra problems
b) solving many small similar problems

This library is targeted towards the middle ground: a medium amount of
decently sized problems. Here the problem is that it is often hard to
use all the computing resources available as a single problem will not
exhaust the device resources and there is no easy way to launch several
similar tasks in parallel.
This problem occurs often in GPU computing where out-of-order execution
of kernels is used to fill up the device. Similarly, it is often hard to
make CPU and GPU work together as GPU tasks are asynchronous while CPU
tasks are often synchronous and it is not straight forward to couple
both without blocking either CPU or GPU.

How does it work?
Expression templates similar to boost::ublas are used to create a
dependency graph: which object is used in which computations and which
tasks have to wait to ensure a correct outcome?

a simple case would be two independent computations like

matrix A=prod(B,trans(B));
matrix C=prod(trans(B),B);

a normal library like ublas would wait until A is computed before C is
evaluated. Instead, it can be more efficient to just add the two tasks
to a list and let worker threads execute them as soon as resources are
available. First the computation of A is put into the graph and A is
marked as "being written to" while B is marked as "being read". When C
is added to the list, it does not depend on the first computation as
several tasks can read from B at the same time. So another worker thread
can grab the task and execute it.

B += prod(D,prod(C,trans(D));

the library computes it as

matrix temp = prod(C,trans(D);
B += prod(D,temp).

temp can be computed as soon as C is computed. But B can only be
computed after A and temp are computed as we can only write to B after
all previous tasks using B are finished.

Are you qualified?
I am main contributor to https://github.com/Shark-ML/Shark/
and as part of it I forked
ublas
https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS
which now uses several linear algebra libraries as possible backends
(ATLAS, openBLAS...).

What is the timeline?
I would start developing a minimal example ASAP based on the rewrite of
ublas above. I would use a minimum set of features (only dense square
matrices) and basic GPU support.

Opinions?

Best,
Oswin