Michael,
please see below
On 01/21/2016 05:23 PM, Michael Lehn
wrote:
Hi Nasos,
first of all I don’t want to take wrong credits and want to
point out that this is not my algorithm. It is based on
For a few cores (4-8) it can easily made multithreaded. For
many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are
"locked" with icpc. Openmp is working properly though.
The demo I posted does not use micro kernels that exploit
SSE, AVX or
FMA instructions. With that the matrix product is on par
with Intel MKL. Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is
compatible to BLIS.
If you compile with -O3 I think you are getting near optimal SSE
vectorization. gcc is truly impressive and intel is even more.
Maybe you could help me to integrate your code in the
benchmark example I posted above.
I will try to find some time to spend on the code.
About Blaze: Do they have their own implementation of a
matrix-matrix product? It seems to require a
tuned BLAS implementation (“Otherwise you get only poor
performance”) for the matrix-matrix product.
I will check the benchmarks I run. I think I was using MKL with
Blaze, but Blaze is taking it a step further (I am not sure how) and
they are getting better performance than the underlying GEMM. Their
benchmarks indicate that they are faster than MKL
(
https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)