Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-24 06:00:53

> Just a footnote to that point. Once you have a fast GEMM the rest of BLAS level 3 is not such a long road. In particular
> all the performance of SYMM, HEMM, TRMM, SYRK, … just depends on the performance of the ugemm micro kernel.
> For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
> Multiplication is done clockwise by packing blocks first in buffers blockA and blockB. If the elements of A are stored
> in the upper triangular part you have three cases:
> (1) The block is completely in the upper part and you pack it using the GEMM-pack
> (2) The block is completely in the lower part and you pack its transposed using the GEMM-pack
> (3) The block is on the diagonal. So you need an extra SYMM-pack routine that also packs the “virtual” elements (so 20 lines of code)
> But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).

Some more notes on advantages of a C++ BLAS implementation:

(a) mixed precision arithmetic

For C = beta*C + alpha*A*B matrices can have differente element types as long as the element wise
operation C(i,j) = beta*C(i,j) + alpha*A(i,l)*B(l,j) is supported.

Because elements anyway get copied into buffers you can upcast them to a common type. The macro-kernel
then operates on homogeneous types. So if the "common type of alpha, A, B" is double it runs with the performance
of dgemm. By the way that already works in the example I posted.

(b) Getting rid of col-major vs row-major

Packing is only a O(n^2) operation. After packing elements are in a cache-friendly format in memory. In my
code you support

        (i) col-major with incRow=1, incCol=numRows
        (ii) row-major with incRow=numCols, incCol=1

But it is not required that either incRow or incCol is 1. That is nice if you matrix actually references every second
row and every third column. In matlab notation that would be A(1:2:m, 1:3:n). For the performance that does not
matter …

(c) Matrix-expressions

Even if a matrix is actually an expression that can be evaluated component wise like A = A1+ A2 the product
C = (A1+A2)*(B1+B2) can be computed with the same performance as with evaluated matrices C = A*B as
long as the expression can be evaluated with O(1).

Features like (a) and (b) are supported in BLIS. But as they are using C99 the implementation using macros
instead of simple template functions is not so easy to understand (IMHO). By default they have disable these
features due to long compile times. In C++ this comes for free. Or more precisely it comes for free for us as
the ugly part is done by the compiler …

Anyway, I think what C++ would need is its own "BLAS C++ Standard Interface” for low-level functionality. This
can be supported by simple reference implementations or sophisticated high-performance implementations. But
we would have a common ground on which one can build high-level C++ libraries.

>> Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).
>> On 2016-01-23 18:53, palik imre wrote:
>>> Hi All,
>>> what's next? I mean what is the development process for ublas?
>>> Now we have a C-like implementation that outperforms both the
>>> mainline, and the branch version (axpy_prod). What will we do with
>>> that?
>>> As far as I see we have the following options:
>>> 1) Create a C++ template magic implementation out of it. But for
>>> this, at the least we would need compile-time access to the target
>>> instruction set. Any idea how to do that?
>>> 2) Create a compiled library implementation out of it, and choose the
>>> implementation run-time based on the CPU capabilities.
>>> 3) Include some good defaults/defines, and hope the user will use
>>> them.
>>> 4) Don't include it, and do something completely different.
>>> What do you think?
>>> Cheers,
>>> Imre
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> Sent to: Oswin.Krause_at_[hidden]
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> Sent to: michael.lehn_at_[hidden]
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> Sent to: michael.lehn_at_[hidden]