
Ublas : 
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 20160126 19:39:03
On 23 Jan 2016, at 18:53, palik imre <imre_palik_at_[hidden]> wrote:
> Hi All,
>
> what's next? I mean what is the development process for ublas?
>
> Now we have a Clike implementation that outperforms both the mainline, and the branch version (axpy_prod). What will we do with that?
>
> As far as I see we have the following options:
>
> 1) Create a C++ template magic implementation out of it. But for this, at the least we would need compiletime access to the target instruction set. Any idea how to do that?
>
> 2) Create a compiled library implementation out of it, and choose the implementation runtime based on the CPU capabilities.
>
> 3) Include some good defaults/defines, and hope the user will use them.
>
> 4) Don't include it, and do something completely different.
>
>
> What do you think?
At the moment I am not sure, but pretty sure, that you don’t have to rewrite uBLAS to support good performance. uBLAS is a pretty fine
library and already has what you need. So just extend it. At least for 383 lines of code :)
As I was pretty busy the last days, so I could not continue until today. I made some minimal modifications to adopt the GEMM implementation
to take advantage of uBLAS:
 The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)
 Only for matrix C in the operation C < beta*C + alpha*A*B it requires that the matrix is stored rowmajor or colmajor. The other matrices can
be expressions. Here I need some help to get rid of this ugly lines of code:
TC *C_ = &C(0,0);
const size_type incRowC = &C(1,0)  &C(0,0);
const size_type incColC = &C(0,1)  &C(0,0);
 Things like the following should work without performance penalty (and in particularly without creating temporaries):
blas::axpy_prod(3*A1+A2, B1  B2, C, matprodUpdate)
 And matrices A and B can be symmetric or whatever. Of course if A or B is triangular a specialized implementation can take advantage
 The storage format does not matter. You can mix rowmajor and colmajor, packed, …
Here is the page for this:
http://www.mathematik.uniulm.de/~lehn/test_ublas/session4/page01.html
Please note, creating the benchmarks for the symmetric matrixmatrix product really takes long because the current uBLAS implementation seems
to be much slower than for general matrices. So I reduced the problem size for now. It would be helpful if some people could reproduce the benchmarks
and also check different cases:
 different expressions
 different element types, e.g. A with floats, B with double etc. At the moment there is only a micro kernel for double. The performance depends on the
common type of A and B. So with complex<..> the reference implementation. But the performance should at least be constant in this case.
It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for these
architectures. If interested I can extend the benchmarks to compare with Intel MKL. For Linux there is a free noncommercial version available.
Cheers,
Michael