Boost logo

Ublas :

Subject: Re: [ublas] GSOC 2013
From: oswin krause (oswin.krause_at_[hidden])
Date: 2013-03-26 15:56:14


Hi,

there is one more thing i want to comment, and this is on the more
serious side:

On 23.03.2013 16:15, Nasos Iliopoulos wrote:
> David,
> Since mdsd:array is a generic multi-dimensional container it is not
> bound to algebraic operations. I expect that with proper aligned
> memory allocation and SSE aglorithms (It is easy to add a custom
> storage container that supports that) it will be as fast as MKL,
> GotoBLAS, Eigen or armadillo. I believe that within that context, a
> GSOC project will need to include both the matrix container and the
> SSE algorithms tasks, or even AVX.
> (http://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
>

and also the starting post from David itself:

On 23.03.2013 13:47, David Bellot wrote:
> OK, the idea behind this is to have a clean framework to enable
> optimization based on SSE, Neon, multi-core, ... you name it.

Just to make this clear: in the current state of the library, SSE, AVX,
multi core computation etc won't cut it as soon as the arguments
involved are bigger than ~32KB. In this case, uBLAS performance is
memory bound.Thus we will only wait more efficient for the next block of
memory. And even if it were not, the way ublas is designed makes it
impossible to use vectorization aside from the c-style functions like
axpy_prod, which can in 99% of all relevant cases be mapped on
BLAS2/BLAS3 calls of the optimized C libraries(which give you AVX/SSE
and OpenMP for free). If you expect that SSE helps you when computing your

A+=prod(B,C);

than you will be desperately disappointed in the current design.

Now maybe some of you are thinking: "But all fast linear algebra
libraries are using SSE, so you must be wrong". Simple answer: these
libraries are not memory bound as they optimize for that (you can
experience this yourself by comparing the performance of copying a big
matrix to transposing it. Than try the transposition block-wise:
allocate a small buffer, say 16x16 elements, and than read 16x16 blocks
from the matrix, write them transposed into the buffer and than copy the
buffer to the correct spot in the target matrix. this gives a factor 7
speed-up on my machine. no SSE, no AVX.).

Don't trust me, trust the writer of the gotoblas library:

Goto, Kazushige, and Robert A. Geijn. "Anatomy of high-performance
matrix multiplication." /ACM Transactions on Mathematical Software
(TOMS)/ 34.3 (2008): 12.

We all don't have enough time to implement fast linear algebra
algorithms. Instead we should fall back to the numeric bindings as often
as possible and use the power of expression templates to generate an
optimal sequence of BLAS2/BLAS3 calls.

I would also like to part in that if it happens.

Greetings,
Oswin