Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-22 09:09:04


Hi Imre,

thanks for running the benchmarks. Of course you are right that using aligned memory for the buffers improves
performance. I also did not really put any effort in optimising the parameters MC, NC, KC, MR and NR. I will
compare different variants and report them on the website

        http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html

I modified my benchmark program such that it also computes the FLOPS as

        FLOPS = 2*m*n*k/time_elpased

See

        http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc

Could you re-run your benchmarks and post the different MFLOPS you get? That is important for actually tuning thing.
On my machine my code only reaches 20% of the peak performance (about 5 GFLOPS instead of 25.6 GFLOPS). So
a speedup of 2.5 would be impressive but still far from peak performance.

Cheers,

Michael

On 22 Jan 2016, at 11:03, palik imre <imre_palik_at_[hidden]> wrote:

> Sorry for posting twice more or less the same thing. I got confused with javascript interfaces.
>
> It seems I also forgot to enable avx for my last measurements. With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).
>
> Cheers,
>
> Imre
>
>
> On Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]> wrote:
>
>
> Hi Michael,
>
> your blocksizes are far from optimal. MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel). Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).
>
> This alone brought something like 50% speedup for my square matrix test.
>
> I will have a look at the other parameters + the whole thing via perf during the weekend.
>
> Cheers,
>
> Imre
>
>
>
> On Friday, 22 January 2016, 0:28, "ublas-request_at_[hidden]" <ublas-request_at_[hidden]> wrote:
>
>
> Subject: Re: [ublas] Matrix multiplication performance
> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi Nasos,
>
> first of all I don?t want to take wrong credits and want to point out that this is not my algorithm. It is based on
>
> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>
> https://github.com/flame/blis
>
> For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more
> sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or
> FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>
> Maybe you could help me to integrate your code in the benchmark example I posted above.
>
> About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a
> tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
> IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2. In that case it makes more
> sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong,
> let me know.
>
> About the block size: In my experience you get better performance if you chose them dynamically at runtime
> depending on the problem size. Depending on the architecture you can just specify ranges like 256 - 384 for
> blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
> I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having
> blocks of similar sizes easily pays of.
>
> Cheers,
>
> Michael
> *************************************
>
>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]