Sorry for posting twice more or less the same thing. I got confused with javascript interfaces.

It seems I also forgot to enable avx for my last measurements. With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).

Cheers,

Imre

On Friday, 22 January 2016, 10:33, palik imre <imre_palik@yahoo.co.uk> wrote:

Hi Michael,

your blocksizes are far from optimal. MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel). Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).

This alone brought something like 50% speedup for my square matrix test.

I will have a look at the other parameters + the whole thing via perf during the weekend.

Cheers,

Imre

On Friday, 22 January 2016, 0:28, "ublas-request@lists.boost.org" <ublas-request@lists.boost.org> wrote:

Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B@uni-ulm.de>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm. It is based on

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2. In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size. Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael
*************************************