On 22 Jan 2016, at 00:28, nasos <nasos_i@hotmail.com> wrote:

Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:

Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more

sophisticated but still not too hard.

Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.

The demo I posted does not use micro kernels that exploit SSE, AVX or

FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote

my own micro-kernels but the interface of function ugemm is compatible to BLIS.

If you compile with -O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.

Maybe you could help me to integrate your code in the benchmark example I posted above.

I will try to find some time to spend on the code.

About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a

tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.

I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)