|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: Nasos Iliopoulos (nasos_i_at_[hidden])
Date: 2016-01-22 08:05:49
Palik,
what was the full compile line you used? g++ -Ofast -mavx ...?
-Nasos
On 01/22/2016 05:03 AM, palik imre wrote:
> Sorry for posting twice more or less the same thing. I got confused
> with javascript interfaces.
>
> It seems I also forgot to enable avx for my last measurements. With
> that + my blocking and alignment changes, performance according to my
> tests is something like 250% higher than running Michael's original
> code (with avx).
>
> Cheers,
>
> Imre
>
>
> On Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]>
> wrote:
>
>
> Hi Michael,
>
> your blocksizes are far from optimal. MR & NR should be multiples of
> the L1 cache line size (i.e. 16 for double on Intel). Also, the
> blocks should be allocated aligned to L1 cache lines (e.g., via
> posix_memalign()).
>
> This alone brought something like 50% speedup for my square matrix test.
>
> I will have a look at the other parameters + the whole thing via perf
> during the weekend.
>
> Cheers,
>
> Imre
>
>
>
> On Friday, 22 January 2016, 0:28, "ublas-request_at_[hidden]"
> <ublas-request_at_[hidden]> wrote:
>
>
> Subject: Re: [ublas] Matrix multiplication performance
> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]
> <mailto:7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi Nasos,
>
> first of all I don?t want to take wrong credits and want to point out
> that this is not my algorithm. It is based on
>
> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>
> https://github.com/flame/blis
>
> For a few cores (4-8) it can easily made multithreaded. For
> many-cores like Intel Xeon Phi this is a bit more
> sophisticated but still not too hard. The demo I posted does not use
> micro kernels that exploit SSE, AVX or
> FMA instructions. With that the matrix product is on par with Intel
> MKL. Just like BLIS. For my platforms I wrote
> my own micro-kernels but the interface of function ugemm is compatible
> to BLIS.
>
> Maybe you could help me to integrate your code in the benchmark
> example I posted above.
>
> About Blaze: Do they have their own implementation of a matrix-matrix
> product? It seems to require a
> tuned BLAS implementation (?Otherwise you get only poor performance?)
> for the matrix-matrix product.
> IMHO they only have tuned the ?easy? stuff like BLAS Level1 and
> Level2. In that case it makes more
> sense to compare the performance with the actual underlying GEMM
> implementation. But if I am wrong,
> let me know.
>
> About the block size: In my experience you get better performance if
> you chose them dynamically at runtime
> depending on the problem size. Depending on the architecture you can
> just specify ranges like 256 - 384 for
> blocking factor MC. In my code it then also needs to satisfy the
> restriction that it can be divided by factor MR.
> I know that doing things at compile time is a C++ fetish. But the
> runtime overhead is negligible and having
> blocks of similar sizes easily pays of.
>
> Cheers,
>
> Michael
> *************************************
>
>
>
>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: nasos_i_at_[hidden]