
Ublas : 
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 20160124 06:00:53
>
>
> Just a footnote to that point. Once you have a fast GEMM the rest of BLAS level 3 is not such a long road. In particular
> all the performance of SYMM, HEMM, TRMM, SYRK, … just depends on the performance of the ugemm micro kernel.
>
> For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
> Multiplication is done clockwise by packing blocks first in buffers blockA and blockB. If the elements of A are stored
> in the upper triangular part you have three cases:
>
> (1) The block is completely in the upper part and you pack it using the GEMMpack
> (2) The block is completely in the lower part and you pack its transposed using the GEMMpack
> (3) The block is on the diagonal. So you need an extra SYMMpack routine that also packs the “virtual” elements (so 20 lines of code)
>
> But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).
Some more notes on advantages of a C++ BLAS implementation:
(a) mixed precision arithmetic
For C = beta*C + alpha*A*B matrices can have differente element types as long as the element wise
operation C(i,j) = beta*C(i,j) + alpha*A(i,l)*B(l,j) is supported.
Because elements anyway get copied into buffers you can upcast them to a common type. The macrokernel
then operates on homogeneous types. So if the "common type of alpha, A, B" is double it runs with the performance
of dgemm. By the way that already works in the example I posted.
(b) Getting rid of colmajor vs rowmajor
Packing is only a O(n^2) operation. After packing elements are in a cachefriendly format in memory. In my
code you support
(i) colmajor with incRow=1, incCol=numRows
(ii) rowmajor with incRow=numCols, incCol=1
But it is not required that either incRow or incCol is 1. That is nice if you matrix actually references every second
row and every third column. In matlab notation that would be A(1:2:m, 1:3:n). For the performance that does not
matter …
(c) Matrixexpressions
Even if a matrix is actually an expression that can be evaluated component wise like A = A1+ A2 the product
C = (A1+A2)*(B1+B2) can be computed with the same performance as with evaluated matrices C = A*B as
long as the expression can be evaluated with O(1).
Features like (a) and (b) are supported in BLIS. But as they are using C99 the implementation using macros
instead of simple template functions is not so easy to understand (IMHO). By default they have disable these
features due to long compile times. In C++ this comes for free. Or more precisely it comes for free for us as
the ugly part is done by the compiler …
Anyway, I think what C++ would need is its own "BLAS C++ Standard Interface” for lowlevel functionality. This
can be supported by simple reference implementations or sophisticated highperformance implementations. But
we would have a common ground on which one can build highlevel C++ libraries.
>
>
>>
>> Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a gitbranch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).
>>
>>
>> On 20160123 18:53, palik imre wrote:
>>> Hi All,
>>> what's next? I mean what is the development process for ublas?
>>> Now we have a Clike implementation that outperforms both the
>>> mainline, and the branch version (axpy_prod). What will we do with
>>> that?
>>> As far as I see we have the following options:
>>> 1) Create a C++ template magic implementation out of it. But for
>>> this, at the least we would need compiletime access to the target
>>> instruction set. Any idea how to do that?
>>> 2) Create a compiled library implementation out of it, and choose the
>>> implementation runtime based on the CPU capabilities.
>>> 3) Include some good defaults/defines, and hope the user will use
>>> them.
>>> 4) Don't include it, and do something completely different.
>>> What do you think?
>>> Cheers,
>>> Imre
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: Oswin.Krause_at_[hidden]
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]