Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-21 18:54:18


On 22 Jan 2016, at 00:30, Michael Lehn <michael.lehn_at_[hidden]> wrote:

>
> On 22 Jan 2016, at 00:28, nasos <nasos_i_at_[hidden]> wrote:
>
>> Michael,
>> please see below
>>
>> On 01/21/2016 05:23 PM, Michael Lehn wrote:
>>> Hi Nasos,
>>>
>>> first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on
>>>
>>> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>
>>> https://github.com/flame/blis
>>>
>>> For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more
>>> sophisticated but still not too hard.
>> Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.
>>
>>> The demo I posted does not use micro kernels that exploit SSE, AVX or
>>> FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
>>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>>>
>> If you compile with -O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.
>
> No, believe me. No chance to beat asm :-)

Have a look here

        http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

for architectures with AVX or FMA its even more impressive. The asm micro kernels are more then just exploiting registers. If
you really want to achieve peak performance you have to do actually math on asm-level. I can extend the page with SSE, AVX
and FMA micro kernels tomorrow. The performance boost is significant. Something like factor 3 for SSE and factor 5 for FMA.
I attached a benchmark from my lecture (http://www.mathematik.uni-ulm.de/numerik/hpc/ws15/uebungen/index.html). It compares
on a Intel i5 the GEMM performance:

- "Blocked Session 8" is basically what I posted here
- “Blocked Session 8 + AVX” replaced “ugemm” with a asm implementation

But anyway. I think we need a common base for doing benchmarks, so you and others can convince yourself on your own hardware.