On 22 Jan 2016, at 00:30, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

On 22 Jan 2016, at 00:28, nasos <nasos_i@hotmail.com> wrote:

Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.

The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

If you compile with -O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.

No, believe me. No chance to beat asm :-)

for architectures with AVX or FMA its even more impressive. The asm micro kernels are more then just exploiting registers. If

you really want to achieve peak performance you have to do actually math on asm-level. I can extend the page with SSE, AVX

and FMA micro kernels tomorrow. The performance boost is significant. Something like factor 3 for SSE and factor 5 for FMA.

I attached a benchmark from my lecture (http://www.mathematik.uni-ulm.de/numerik/hpc/ws15/uebungen/index.html). It compares

on a Intel i5 the GEMM performance:

- "Blocked Session 8" is basically what I posted here

- “Blocked Session 8 + AVX” replaced “ugemm” with a asm implementation

But anyway. I think we need a common base for doing benchmarks, so you and others can convince yourself on your own hardware.