|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-28 12:08:03
Wow:
$ cd session4
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
matprod.cc: In function ���double estimateGemmResidual(const MA&, const MB&, const MC0&, const MC1&)���:
matprod.cc:94:40: warning: typedef ���TC0��� locally defined but not used [-Wunused-local-typedefs]
    typedef typename MC0::value_type  TC0;
                                       ^
$ ./a.out
#  m    n    k uBLAS:  t1      MFLOPS  Blocked:  t2     MFLOPS       Diff nrm1
 100  100  100 0.000465896      4292.8     0.00405376     493.369              0
 200  200  200   0.0046712     3425.24     0.00244891     6533.51              0
 300  300  300   0.0161551     3342.59     0.00314843     17151.4    1.94151e-16
 400  400  400    0.032688     3915.81     0.00240584       53204    8.40477e-17
 500  500  500   0.0544033     4595.31     0.00291218     85846.4    3.52062e-17
 600  600  600   0.0838317     5153.18     0.00453539     95250.9    1.63452e-17
 700  700  700    0.130899     5240.69     0.00585201      117225    8.44769e-18
 800  800  800    0.201121     5091.46       0.010872     94186.5    4.72429e-18
 900  900  900    0.286407     5090.65      0.0151735     96088.7    2.80443e-18
 1000 1000 1000    0.432211     4627.37      0.0707567     28265.9     1.7626e-18
 1100 1100 1100    0.511146      5207.9      0.0186911      142420    1.14521e-18
 1200 1200 1200    0.666975      5181.6       0.025109      137640    7.79963e-19
 1300 1300 1300    0.863769     5087.01      0.0283398      155047    5.45468e-19
 1400 1400 1400     1.09638     5005.57       0.143209     38321.6    3.90302e-19
 1500 1500 1500     1.40352     4809.33       0.120096     56204.9     2.8667e-19
$ cd ../session2
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
[ec2-user_at_ip-10-0-46-255 session2]$ ./a.out
#  m    n    k uBLAS:  t1      MFLOPS  Blocked:  t2     MFLOPS       Diff nrm1
 100  100  100 0.000471888     4238.29    0.000231317     8646.14              0
 200  200  200  0.00431625     3706.92     0.00121122     13209.9              0
 300  300  300   0.0153292     3522.69     0.00336464     16049.3    1.07937e-06
 400  400  400   0.0317138      4036.1     0.00712568     17963.2    4.06488e-06
 500  500  500    0.052809     4734.04      0.0121626     20554.9    9.09947e-06
 600  600  600   0.0828121     5216.63       0.020657       20913    1.65243e-05
 700  700  700    0.131053     5234.51      0.0318276     21553.6    2.71365e-05
 800  800  800    0.196825     5202.58      0.0482679     21214.9    4.12109e-05
 900  900  900    0.281006     5188.51      0.0671323     21718.3    5.93971e-05
 1000 1000 1000    0.386332     5176.89      0.0906054     22073.7    8.23438e-05
 1100 1100 1100     0.51667     5152.22       0.124346     21408.1    0.000109566
 1200 1200 1200    0.668425     5170.37       0.159701     21640.5    0.000142817
 1300 1300 1300    0.860445     5106.66       0.203472     21595.1    0.000182219
 1400 1400 1400     1.08691     5049.18       0.249427     22002.4    0.000226999
 1500 1500 1500     1.38244     4882.67       0.307519     21949.9    0.000280338
This is on Haswell
On Wednesday, 27 January 2016, 1:39, Michael Lehn <michael.lehn_at_[hidden]> wrote:
On 23 Jan 2016, at 18:53, palik imre <imre_palik_at_[hidden]> wrote:
Hi All,
what's next? I mean what is the development process for ublas?
Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod). What will we do with that?
As far as I see we have the following options:
1) Create a C++ template magic implementation out of it. But for this, at the least we would need compile-time access to the target instruction set. Any idea how to do that?
2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.
3) Include some good defaults/defines, and hope the user will use them.
4) Don't include it, and do something completely different.
What do you think?
At the moment I am not sure, but pretty sure, that you donât have to rewrite uBLAS to support good performance. Â uBLAS is a pretty finelibrary and already has what you need. Â So just extend it. At least for 383 lines of code :-)
As I was pretty busy the last days, so I could not continue until today. Â I made some minimal modifications to adopt the GEMM implementationto take advantage of uBLAS:
- The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)- Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major. Â The other matrices canbe expressions. Here I need some help to get rid of this ugly lines of code:
  TC *C_ = &C(0,0);  const size_type incRowC = &C(1,0) - &C(0,0);  const size_type incColC = &C(0,1) - &C(0,0);
- Things like the following should work without performance penalty (and in particularly without creating temporaries):   blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate) - And matrices A and B can be symmetric or whatever.  Of course if A or B is triangular a specialized implementation can take advantage- The storage format does not matter.  You can mix row-major and col-major, packed, â¦
Here is the page for this:
http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html
Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seemsto be much slower than for general matrices. Â So I reduced the problem size for now. Â It would be helpful if some people could reproduce the benchmarksand also check different cases:
- different expressions- different element types, e.g. A with floats, B with double etc. Â At the moment there is only a micro kernel for double. Â The performance depends on thecommon type of A and B. Â So with complex<..> the reference implementation. Â But the performance should at least be constant in this case.
It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for thesearchitectures. Â If interested I can extend the benchmarks to compare with Intel MKL. Â For Linux there is a free non-commercial version available.
Cheers,
Michael