Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-28 12:41:03


Also the parallelisation with openmp is done pretty cheap and simple at the moment. So you also
might want to check how it scales by

 export OMP_NUM_THREADS=2; ./a.out
 export OMP_NUM_THREADS=4; ./a.out
 export OMP_NUM_THREADS=6; ./a.out
...

On 28 Jan 2016, at 18:34, Michael Lehn <michael.lehn_at_[hidden]> wrote:

> Could you also compile with -DHAVE_FMA? However, I noticed that I messed up the file “fma.hpp” when
> I created the page. So you should replace it with
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/fma.hpp
>
> At the moment I regenerate the pages on the website
>
>
> On 28 Jan 2016, at 18:08, palik imre <imre_palik_at_[hidden]> wrote:
>
>> Wow:
>>
>> $ cd session4
>> $ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
>> matprod.cc: In function ���double estimateGemmResidual(const MA&, const MB&, const MC0&, const MC1&)���:
>> matprod.cc:94:40: warning: typedef ���TC0��� locally defined but not used [-Wunused-local-typedefs]
>> typedef typename MC0::value_type TC0;
>> ^
>> $ ./a.out
>> # m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
>> 100 100 100 0.000465896 4292.8 0.00405376 493.369 0
>> 200 200 200 0.0046712 3425.24 0.00244891 6533.51 0
>> 300 300 300 0.0161551 3342.59 0.00314843 17151.4 1.94151e-16
>> 400 400 400 0.032688 3915.81 0.00240584 53204 8.40477e-17
>> 500 500 500 0.0544033 4595.31 0.00291218 85846.4 3.52062e-17
>> 600 600 600 0.0838317 5153.18 0.00453539 95250.9 1.63452e-17
>> 700 700 700 0.130899 5240.69 0.00585201 117225 8.44769e-18
>> 800 800 800 0.201121 5091.46 0.010872 94186.5 4.72429e-18
>> 900 900 900 0.286407 5090.65 0.0151735 96088.7 2.80443e-18
>> 1000 1000 1000 0.432211 4627.37 0.0707567 28265.9 1.7626e-18
>> 1100 1100 1100 0.511146 5207.9 0.0186911 142420 1.14521e-18
>> 1200 1200 1200 0.666975 5181.6 0.025109 137640 7.79963e-19
>> 1300 1300 1300 0.863769 5087.01 0.0283398 155047 5.45468e-19
>> 1400 1400 1400 1.09638 5005.57 0.143209 38321.6 3.90302e-19
>> 1500 1500 1500 1.40352 4809.33 0.120096 56204.9 2.8667e-19
>> $ cd ../session2
>> $ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
>> [ec2-user_at_ip-10-0-46-255 session2]$ ./a.out
>> # m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
>> 100 100 100 0.000471888 4238.29 0.000231317 8646.14 0
>> 200 200 200 0.00431625 3706.92 0.00121122 13209.9 0
>> 300 300 300 0.0153292 3522.69 0.00336464 16049.3 1.07937e-06
>> 400 400 400 0.0317138 4036.1 0.00712568 17963.2 4.06488e-06
>> 500 500 500 0.052809 4734.04 0.0121626 20554.9 9.09947e-06
>> 600 600 600 0.0828121 5216.63 0.020657 20913 1.65243e-05
>> 700 700 700 0.131053 5234.51 0.0318276 21553.6 2.71365e-05
>> 800 800 800 0.196825 5202.58 0.0482679 21214.9 4.12109e-05
>> 900 900 900 0.281006 5188.51 0.0671323 21718.3 5.93971e-05
>> 1000 1000 1000 0.386332 5176.89 0.0906054 22073.7 8.23438e-05
>> 1100 1100 1100 0.51667 5152.22 0.124346 21408.1 0.000109566
>> 1200 1200 1200 0.668425 5170.37 0.159701 21640.5 0.000142817
>> 1300 1300 1300 0.860445 5106.66 0.203472 21595.1 0.000182219
>> 1400 1400 1400 1.08691 5049.18 0.249427 22002.4 0.000226999
>> 1500 1500 1500 1.38244 4882.67 0.307519 21949.9 0.000280338
>>
>> This is on Haswell
>>
>>
>> On Wednesday, 27 January 2016, 1:39, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>>
>>
>>
>> On 23 Jan 2016, at 18:53, palik imre <imre_palik_at_[hidden]> wrote:
>>
>>> Hi All,
>>>
>>> what's next? I mean what is the development process for ublas?
>>>
>>> Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod). What will we do with that?
>>>
>>> As far as I see we have the following options:
>>>
>>> 1) Create a C++ template magic implementation out of it. But for this, at the least we would need compile-time access to the target instruction set. Any idea how to do that?
>>>
>>> 2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.
>>>
>>> 3) Include some good defaults/defines, and hope the user will use them.
>>>
>>> 4) Don't include it, and do something completely different.
>>>
>>>
>>> What do you think?
>>
>>
>> At the moment I am not sure, but pretty sure, that you don’t have to rewrite uBLAS to support good performance. uBLAS is a pretty fine
>> library and already has what you need. So just extend it. At least for 383 lines of code :-)
>>
>> As I was pretty busy the last days, so I could not continue until today. I made some minimal modifications to adopt the GEMM implementation
>> to take advantage of uBLAS:
>>
>> - The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)
>> - Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major. The other matrices can
>> be expressions. Here I need some help to get rid of this ugly lines of code:
>>
>> TC *C_ = &C(0,0);
>> const size_type incRowC = &C(1,0) - &C(0,0);
>> const size_type incColC = &C(0,1) - &C(0,0);
>>
>> - Things like the following should work without performance penalty (and in particularly without creating temporaries):
>>
>> blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)
>>
>> - And matrices A and B can be symmetric or whatever. Of course if A or B is triangular a specialized implementation can take advantage
>> - The storage format does not matter. You can mix row-major and col-major, packed, …
>>
>> Here is the page for this:
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html
>>
>> Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seems
>> to be much slower than for general matrices. So I reduced the problem size for now. It would be helpful if some people could reproduce the benchmarks
>> and also check different cases:
>>
>> - different expressions
>> - different element types, e.g. A with floats, B with double etc. At the moment there is only a micro kernel for double. The performance depends on the
>> common type of A and B. So with complex<..> the reference implementation. But the performance should at least be constant in this case.
>>
>> It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for these
>> architectures. If interested I can extend the benchmarks to compare with Intel MKL. For Linux there is a free non-commercial version available.
>>
>>
>> Cheers,
>>
>> Michael
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]