Ublas :

Date view	Thread view	Subject view	Author view

Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-28 12:08:03

Next message: palik imre: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Next in thread: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Reply: Michael Lehn: "Re: [ublas] Matrix multiplication performance"

Wow:
$ cd session4
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
matprod.cc: In function ï¿½ï¿½ï¿½double estimateGemmResidual(const MA&, const MB&, const MC0&, const MC1&)ï¿½ï¿½ï¿½:
matprod.cc:94:40: warning: typedef ï¿½ï¿½ï¿½TC0ï¿½ï¿½ï¿½ locally defined but not used [-Wunused-local-typedefs]
Â Â Â Â typedef typename MC0::value_typeÂ Â TC0;
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ^
$ ./a.out
#Â Â mÂ Â Â Â nÂ Â Â Â kÂ uBLAS:Â Â t1Â Â Â Â Â Â MFLOPSÂ Â Blocked:Â Â t2Â Â Â Â Â MFLOPSÂ Â Â Â Â Â Â Diff nrm1
Â 100Â Â 100Â Â 100Â 0.000465896Â Â Â Â Â Â 4292.8Â Â Â Â Â 0.00405376Â Â Â Â Â 493.369Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0
Â 200Â Â 200Â Â 200Â Â Â 0.0046712Â Â Â Â Â 3425.24Â Â Â Â Â 0.00244891Â Â Â Â Â 6533.51Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0
Â 300Â Â 300Â Â 300Â Â Â 0.0161551Â Â Â Â Â 3342.59Â Â Â Â Â 0.00314843Â Â Â Â Â 17151.4Â Â Â Â 1.94151e-16
Â 400Â Â 400Â Â 400Â Â Â Â 0.032688Â Â Â Â Â 3915.81Â Â Â Â Â 0.00240584Â Â Â Â Â Â Â 53204Â Â Â Â 8.40477e-17
Â 500Â Â 500Â Â 500Â Â Â 0.0544033Â Â Â Â Â 4595.31Â Â Â Â Â 0.00291218Â Â Â Â Â 85846.4Â Â Â Â 3.52062e-17
Â 600Â Â 600Â Â 600Â Â Â 0.0838317Â Â Â Â Â 5153.18Â Â Â Â Â 0.00453539Â Â Â Â Â 95250.9Â Â Â Â 1.63452e-17
Â 700Â Â 700Â Â 700Â Â Â Â 0.130899Â Â Â Â Â 5240.69Â Â Â Â Â 0.00585201Â Â Â Â Â Â 117225Â Â Â Â 8.44769e-18
Â 800Â Â 800Â Â 800Â Â Â Â 0.201121Â Â Â Â Â 5091.46Â Â Â Â Â Â Â 0.010872Â Â Â Â Â 94186.5Â Â Â Â 4.72429e-18
Â 900Â Â 900Â Â 900Â Â Â Â 0.286407Â Â Â Â Â 5090.65Â Â Â Â Â Â 0.0151735Â Â Â Â Â 96088.7Â Â Â Â 2.80443e-18
Â 1000Â 1000Â 1000Â Â Â Â 0.432211Â Â Â Â Â 4627.37Â Â Â Â Â Â 0.0707567Â Â Â Â Â 28265.9Â Â Â Â Â 1.7626e-18
Â 1100Â 1100Â 1100Â Â Â Â 0.511146Â Â Â Â Â Â 5207.9Â Â Â Â Â Â 0.0186911Â Â Â Â Â Â 142420Â Â Â Â 1.14521e-18
Â 1200Â 1200Â 1200Â Â Â Â 0.666975Â Â Â Â Â Â 5181.6Â Â Â Â Â Â Â 0.025109Â Â Â Â Â Â 137640Â Â Â Â 7.79963e-19
Â 1300Â 1300Â 1300Â Â Â Â 0.863769Â Â Â Â Â 5087.01Â Â Â Â Â Â 0.0283398Â Â Â Â Â Â 155047Â Â Â Â 5.45468e-19
Â 1400Â 1400Â 1400Â Â Â Â Â 1.09638Â Â Â Â Â 5005.57Â Â Â Â Â Â Â 0.143209Â Â Â Â Â 38321.6Â Â Â Â 3.90302e-19
Â 1500Â 1500Â 1500Â Â Â Â Â 1.40352Â Â Â Â Â 4809.33Â Â Â Â Â Â Â 0.120096Â Â Â Â Â 56204.9Â Â Â Â Â 2.8667e-19
$ cd ../session2
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
[ec2-user_at_ip-10-0-46-255 session2]$ ./a.out
#Â Â mÂ Â Â Â nÂ Â Â Â kÂ uBLAS:Â Â t1Â Â Â Â Â Â MFLOPSÂ Â Blocked:Â Â t2Â Â Â Â Â MFLOPSÂ Â Â Â Â Â Â Diff nrm1
Â 100Â Â 100Â Â 100Â 0.000471888Â Â Â Â Â 4238.29Â Â Â Â 0.000231317Â Â Â Â Â 8646.14Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0
Â 200Â Â 200Â Â 200Â Â 0.00431625Â Â Â Â Â 3706.92Â Â Â Â Â 0.00121122Â Â Â Â Â 13209.9Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0
Â 300Â Â 300Â Â 300Â Â Â 0.0153292Â Â Â Â Â 3522.69Â Â Â Â Â 0.00336464Â Â Â Â Â 16049.3Â Â Â Â 1.07937e-06
Â 400Â Â 400Â Â 400Â Â Â 0.0317138Â Â Â Â Â Â 4036.1Â Â Â Â Â 0.00712568Â Â Â Â Â 17963.2Â Â Â Â 4.06488e-06
Â 500Â Â 500Â Â 500Â Â Â Â 0.052809Â Â Â Â Â 4734.04Â Â Â Â Â Â 0.0121626Â Â Â Â Â 20554.9Â Â Â Â 9.09947e-06
Â 600Â Â 600Â Â 600Â Â Â 0.0828121Â Â Â Â Â 5216.63Â Â Â Â Â Â Â 0.020657Â Â Â Â Â Â Â 20913Â Â Â Â 1.65243e-05
Â 700Â Â 700Â Â 700Â Â Â Â 0.131053Â Â Â Â Â 5234.51Â Â Â Â Â Â 0.0318276Â Â Â Â Â 21553.6Â Â Â Â 2.71365e-05
Â 800Â Â 800Â Â 800Â Â Â Â 0.196825Â Â Â Â Â 5202.58Â Â Â Â Â Â 0.0482679Â Â Â Â Â 21214.9Â Â Â Â 4.12109e-05
Â 900Â Â 900Â Â 900Â Â Â Â 0.281006Â Â Â Â Â 5188.51Â Â Â Â Â Â 0.0671323Â Â Â Â Â 21718.3Â Â Â Â 5.93971e-05
Â 1000Â 1000Â 1000Â Â Â Â 0.386332Â Â Â Â Â 5176.89Â Â Â Â Â Â 0.0906054Â Â Â Â Â 22073.7Â Â Â Â 8.23438e-05
Â 1100Â 1100Â 1100Â Â Â Â Â 0.51667Â Â Â Â Â 5152.22Â Â Â Â Â Â Â 0.124346Â Â Â Â Â 21408.1Â Â Â Â 0.000109566
Â 1200Â 1200Â 1200Â Â Â Â 0.668425Â Â Â Â Â 5170.37Â Â Â Â Â Â Â 0.159701Â Â Â Â Â 21640.5Â Â Â Â 0.000142817
Â 1300Â 1300Â 1300Â Â Â Â 0.860445Â Â Â Â Â 5106.66Â Â Â Â Â Â Â 0.203472Â Â Â Â Â 21595.1Â Â Â Â 0.000182219
Â 1400Â 1400Â 1400Â Â Â Â Â 1.08691Â Â Â Â Â 5049.18Â Â Â Â Â Â Â 0.249427Â Â Â Â Â 22002.4Â Â Â Â 0.000226999
Â 1500Â 1500Â 1500Â Â Â Â Â 1.38244Â Â Â Â Â 4882.67Â Â Â Â Â Â Â 0.307519Â Â Â Â Â 21949.9Â Â Â Â 0.000280338

This is on Haswell

On Wednesday, 27 January 2016, 1:39, Michael Lehn <michael.lehn_at_[hidden]> wrote:

On 23 Jan 2016, at 18:53, palik imre <imre_palik_at_[hidden]> wrote:

Hi All,
what's next?Â I mean what is the development process for ublas?
Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod).Â What will we do with that?
As far as I see we have the following options:
1) Create a C++ template magic implementation out of it.Â But for this, at the least we would need compile-time access to the target instruction set.Â Any idea how to do that?
2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use them.
4) Don't include it, and do something completely different.

What do you think?

At the moment I am not sure, but pretty sure, that you donâ€™t have to rewrite uBLAS to support good performance. Â uBLAS is a pretty finelibrary and already has what you need. Â So just extend it. At least for 383 lines of code :-)
As I was pretty busy the last days, so I could not continue until today. Â I made some minimal modifications to adopt the GEMM implementationto take advantage of uBLAS:
- The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)- Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major. Â The other matrices canbe expressions. Here I need some help to get rid of this ugly lines of code:
Â Â TC *C_ = &C(0,0);Â Â const size_type incRowC = &C(1,0) - &C(0,0);Â Â const size_type incColC = &C(0,1) - &C(0,0);
- Things like the following should work without performance penalty (and in particularly without creating temporaries): Â Â blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)Â - And matrices A and B can be symmetric or whatever. Â Of course if A or B is triangular a specialized implementation can take advantage- The storage format does not matter. Â You can mix row-major and col-major, packed, â€¦
Here is the page for this:
http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html
Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seemsto be much slower than for general matrices. Â So I reduced the problem size for now. Â It would be helpful if some people could reproduce the benchmarksand also check different cases:
- different expressions- different element types, e.g. A with floats, B with double etc. Â At the moment there is only a micro kernel for double. Â The performance depends on thecommon type of A and B. Â So with complex<..> the reference implementation. Â But the performance should at least be constant in this case.
It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for thesearchitectures. Â If interested I can extend the benchmarks to compare with Intel MKL. Â For Linux there is a free non-commercial version available.

Cheers,
Michael

text/html attachment: attachment

Next message: palik imre: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Next in thread: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Reply: Michael Lehn: "Re: [ublas] Matrix multiplication performance"

Date view	Thread view	Subject view	Author view