Subject:
Re: [ublas] Matrix multiplication
performance
Message-ID: <
7FA49567-4580-4C7F-9B9E-43E08E78E14B@uni-ulm.de>
Content-Type: text/plain;
charset="windows-1252"
Hi Nasos,
first of all I don?t want to take wrong
credits and want to point out that this is
not my algorithm. It is based on
http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
https://github.com/flame/blis
For a few cores (4-8) it can easily made
multithreaded. For many-cores like Intel
Xeon Phi this is a bit more
sophisticated but still not too hard. The
demo I posted does not use micro kernels
that exploit SSE, AVX or
FMA instructions. With that the matrix
product is on par with Intel MKL. Just
like BLIS. For my platforms I wrote
my own micro-kernels but the interface of
function ugemm is compatible to BLIS.
Maybe you could help me to integrate your
code in the benchmark example I posted
above.
About Blaze: Do they have their own
implementation of a matrix-matrix
product? It seems to require a
tuned BLAS implementation (?Otherwise you
get only poor performance?) for the
matrix-matrix product.
IMHO they only have tuned the ?easy? stuff
like BLAS Level1 and Level2. In that case
it makes more
sense to compare the performance with the
actual underlying GEMM implementation.
But if I am wrong,
let me know.
About the block size: In my experience you
get better performance if you chose them
dynamically at runtime
depending on the problem size. Depending
on the architecture you can just specify
ranges like 256 - 384 for
blocking factor MC. In my code it then
also needs to satisfy the restriction that
it can be divided by factor MR.
I know that doing things at compile time
is a C++ fetish. But the runtime overhead
is negligible and having
blocks of similar sizes easily pays of.
Cheers,
Michael
*************************************