|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-21 18:30:42
On 22 Jan 2016, at 00:28, nasos <nasos_i_at_[hidden]> wrote:
> Michael,
> please see below
>
> On 01/21/2016 05:23 PM, Michael Lehn wrote:
>> Hi Nasos,
>>
>> first of all I dont want to take wrong credits and want to point out that this is not my algorithm. It is based on
>>
>> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>
>> https://github.com/flame/blis
>>
>> For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more
>> sophisticated but still not too hard.
> Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.
>
>> The demo I posted does not use micro kernels that exploit SSE, AVX or
>> FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>>
> If you compile with -O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.
No, believe me. No chance to beat asm :-)