From: nasos <
nasos_i@hotmail.com>
To:
ublas@lists.boost.orgSubject: Re: [ublas] Matrix multiplication performance
Message-ID: <
BLU436-SMTP2214ADFAE38F3F9E1812E2999C30@phx.gbl>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Michael,
please see below
On 01/21/2016 05:23 PM, Michael Lehn wrote:
> Hi Nasos,
>
> first of all I don?t want to take wrong credits and want to point out
> that this is not my algorithm. It is based on
>
>
http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf>
>
https://github.com/flame/blis>
> For a few cores (4-8) it can easily made multithreaded. For
> many-cores like Intel Xeon Phi this is a bit more
> sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are "locked"
with icpc. Openmp is working properly though.
> The demo I posted does not use micro kernels that exploit SSE, AVX or
> FMA instructions. With that the matrix product is on par with Intel
> MKL. Just like BLIS. For my platforms I wrote
> my own micro-kernels but the interface of function ugemm is compatible
> to BLIS.
>
If you compile with -O3 I think you are getting near optimal SSE
vectorization. gcc is truly impressive and intel is even more.