Could you also compile with -DHAVE_FMA? However, I noticed that I messed up the file “fma.hpp” when_______________________________________________I created the page. So you should replace it withAt the moment I regenerate the pages on the websiteOn 28 Jan 2016, at 18:08, palik imre <imre_palik@yahoo.co.uk> wrote:Wow:$ cd session4
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
matprod.cc: In function ���double estimateGemmResidual(const MA&, const MB&, const MC0&, const MC1&)���:
matprod.cc:94:40: warning: typedef ���TC0��� locally defined but not used [-Wunused-local-typedefs]
typedef typename MC0::value_type TC0;
^
$ ./a.out
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
100 100 100 0.000465896 4292.8 0.00405376 493.369 0
200 200 200 0.0046712 3425.24 0.00244891 6533.51 0
300 300 300 0.0161551 3342.59 0.00314843 17151.4 1.94151e-16
400 400 400 0.032688 3915.81 0.00240584 53204 8.40477e-17
500 500 500 0.0544033 4595.31 0.00291218 85846.4 3.52062e-17
600 600 600 0.0838317 5153.18 0.00453539 95250.9 1.63452e-17
700 700 700 0.130899 5240.69 0.00585201 117225 8.44769e-18
800 800 800 0.201121 5091.46 0.010872 94186.5 4.72429e-18
900 900 900 0.286407 5090.65 0.0151735 96088.7 2.80443e-18
1000 1000 1000 0.432211 4627.37 0.0707567 28265.9 1.7626e-18
1100 1100 1100 0.511146 5207.9 0.0186911 142420 1.14521e-18
1200 1200 1200 0.666975 5181.6 0.025109 137640 7.79963e-19
1300 1300 1300 0.863769 5087.01 0.0283398 155047 5.45468e-19
1400 1400 1400 1.09638 5005.57 0.143209 38321.6 3.90302e-19
1500 1500 1500 1.40352 4809.33 0.120096 56204.9 2.8667e-19
$ cd ../session2
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
[ec2-user@ip-10-0-46-255 session2]$ ./a.out
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
100 100 100 0.000471888 4238.29 0.000231317 8646.14 0
200 200 200 0.00431625 3706.92 0.00121122 13209.9 0
300 300 300 0.0153292 3522.69 0.00336464 16049.3 1.07937e-06
400 400 400 0.0317138 4036.1 0.00712568 17963.2 4.06488e-06
500 500 500 0.052809 4734.04 0.0121626 20554.9 9.09947e-06
600 600 600 0.0828121 5216.63 0.020657 20913 1.65243e-05
700 700 700 0.131053 5234.51 0.0318276 21553.6 2.71365e-05
800 800 800 0.196825 5202.58 0.0482679 21214.9 4.12109e-05
900 900 900 0.281006 5188.51 0.0671323 21718.3 5.93971e-05
1000 1000 1000 0.386332 5176.89 0.0906054 22073.7 8.23438e-05
1100 1100 1100 0.51667 5152.22 0.124346 21408.1 0.000109566
1200 1200 1200 0.668425 5170.37 0.159701 21640.5 0.000142817
1300 1300 1300 0.860445 5106.66 0.203472 21595.1 0.000182219
1400 1400 1400 1.08691 5049.18 0.249427 22002.4 0.000226999
1500 1500 1500 1.38244 4882.67 0.307519 21949.9 0.000280338This is on HaswellOn Wednesday, 27 January 2016, 1:39, Michael Lehn <michael.lehn@uni-ulm.de> wrote:
On 23 Jan 2016, at 18:53, palik imre <imre_palik@yahoo.co.uk> wrote:Hi All,what's next? I mean what is the development process for ublas?Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod). What will we do with that?As far as I see we have the following options:1) Create a C++ template magic implementation out of it. But for this, at the least we would need compile-time access to the target instruction set. Any idea how to do that?2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.3) Include some good defaults/defines, and hope the user will use them.4) Don't include it, and do something completely different.What do you think?At the moment I am not sure, but pretty sure, that you don’t have to rewrite uBLAS to support good performance. uBLAS is a pretty finelibrary and already has what you need. So just extend it. At least for 383 lines of code :-)As I was pretty busy the last days, so I could not continue until today. I made some minimal modifications to adopt the GEMM implementationto take advantage of uBLAS:- The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)- Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major. The other matrices canbe expressions. Here I need some help to get rid of this ugly lines of code:TC *C_ = &C(0,0);const size_type incRowC = &C(1,0) - &C(0,0);const size_type incColC = &C(0,1) - &C(0,0);- Things like the following should work without performance penalty (and in particularly without creating temporaries):blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)- And matrices A and B can be symmetric or whatever. Of course if A or B is triangular a specialized implementation can take advantage- The storage format does not matter. You can mix row-major and col-major, packed, …Here is the page for this:Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seemsto be much slower than for general matrices. So I reduced the problem size for now. It would be helpful if some people could reproduce the benchmarksand also check different cases:- different expressions- different element types, e.g. A with floats, B with double etc. At the moment there is only a micro kernel for double. The performance depends on thecommon type of A and B. So with complex<..> the reference implementation. But the performance should at least be constant in this case.It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for thesearchitectures. If interested I can extend the benchmarks to compare with Intel MKL. For Linux there is a free non-commercial version available.Cheers,Michael
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de