Boost logo

Boost :

Subject: Re: [boost] interest in an asynchronous linear algebra library?
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 2015-12-07 03:47:36

Hi David,

> And your work would be recognized as being part of ublas directly and
> be
> used right away by thousands of users.
> Let me know if you're interested contributing back to ublas instead.

Yes, I am interested, but currently I do not see a roadmap as ublas is
SO big, that I fear that I can not fix as much code as my proposed
changes would break (even without the asynchronous part, gpus...). One
of the starting points of my rewrite was that I deleted around 40k lines
of ublas code before i started and then designed the interface around a
defined subset of features (namely one dense format and one sparse
format, later adding the triangular packed format. I still haven't
gotten sparse right, though.). From then on i decided to break the
functionality apart in a few places (assignment, kernels, and iterators)
and use the expression templates as a simple connection between the

Another point: I would not propose a major change to an existing boost
library without triggering a formal boost review process - this change
would break any code that relies on any ublas internals.


> Cheers,
> David
> On Thu, Nov 19, 2015 at 2:55 PM, Oswin Krause <
> Oswin.Krause_at_[hidden]> wrote:
>> Hi list,
>> Disclaimer: What I am proposing here does not exist (yet). But it will
>> in
>> the near future and if there is interest i would like to give
>> something
>> back to the boost community.
>> I would like to propose a new library to boost: asynchronous blas.
>> What is it and what can I do with it?
>> It is a linear algebra library where all computations are carried out
>> in
>> an asynchronous manner on GPU and CPU. Computing a matrix-matrix
>> product
>> does not block the current thread but instead launches a task which at
>> some
>> point returns with a result, allowing the main thread to continue
>> until the
>> result is needed. GPU and CPU tasks can be started at the same time
>> depending on each other (e.g. a CPU computation waits until the GPU
>> returned some needed result and vice-versa)
>> Why is this important? Which problems does it solve?
>> Many linear algebra libraries are target towards either
>> a) solving a few large linear algebra problems
>> b) solving many small similar problems
>> This library is targeted towards the middle ground: a medium amount of
>> decently sized problems. Here the problem is that it is often hard to
>> use
>> all the computing resources available as a single problem will not
>> exhaust
>> the device resources and there is no easy way to launch several
>> similar
>> tasks in parallel.
>> This problem occurs often in GPU computing where out-of-order
>> execution of
>> kernels is used to fill up the device. Similarly, it is often hard to
>> make
>> CPU and GPU work together as GPU tasks are asynchronous while CPU
>> tasks are
>> often synchronous and it is not straight forward to couple both
>> without
>> blocking either CPU or GPU.
>> How does it work?
>> Expression templates similar to boost::ublas are used to create a
>> dependency graph: which object is used in which computations and which
>> tasks have to wait to ensure a correct outcome?
>> a simple case would be two independent computations like
>> matrix A=prod(B,trans(B));
>> matrix C=prod(trans(B),B);
>> a normal library like ublas would wait until A is computed before C is
>> evaluated. Instead, it can be more efficient to just add the two tasks
>> to a
>> list and let worker threads execute them as soon as resources are
>> available. First the computation of A is put into the graph and A is
>> marked
>> as "being written to" while B is marked as "being read". When C is
>> added to
>> the list, it does not depend on the first computation as several tasks
>> can
>> read from B at the same time. So another worker thread can grab the
>> task
>> and execute it.
>> If we now add the task
>> B += prod(D,prod(C,trans(D));
>> the library computes it as
>> matrix temp = prod(C,trans(D);
>> B += prod(D,temp).
>> temp can be computed as soon as C is computed. But B can only be
>> computed
>> after A and temp are computed as we can only write to B after all
>> previous
>> tasks using B are finished.
>> Are you qualified?
>> I am main contributor to
>> and as part of it I forked
>> ublas
>> which now uses several linear algebra libraries as possible backends
>> (ATLAS, openBLAS...).
>> What is the timeline?
>> I would start developing a minimal example ASAP based on the rewrite
>> of
>> ublas above. I would use a minimum set of features (only dense square
>> matrices) and basic GPU support.
>> Opinions?
>> Best,
>> Oswin
>> _______________________________________________
>> Unsubscribe & other changes:
> _______________________________________________
> Unsubscribe & other changes:

Boost list run by bdawes at, gregod at, cpdaniel at, john at