Wow:

$ cd session4
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
matprod.cc: In function ���double estimateGemmResidual(const MA&, const MB&, const MC0&, const MC1&)���:
matprod.cc:94:40: warning: typedef ���TC0��� locally defined but not used [-Wunused-local-typedefs]
     typedef typename MC0::value_type   TC0;
                                        ^
$ ./a.out
#   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
  100   100   100  0.000465896       4292.8      0.00405376      493.369               0
  200   200   200    0.0046712      3425.24      0.00244891      6533.51               0
  300   300   300    0.0161551      3342.59      0.00314843      17151.4     1.94151e-16
  400   400   400     0.032688      3915.81      0.00240584        53204     8.40477e-17
  500   500   500    0.0544033      4595.31      0.00291218      85846.4     3.52062e-17
  600   600   600    0.0838317      5153.18      0.00453539      95250.9     1.63452e-17
  700   700   700     0.130899      5240.69      0.00585201       117225     8.44769e-18
  800   800   800     0.201121      5091.46        0.010872      94186.5     4.72429e-18
  900   900   900     0.286407      5090.65       0.0151735      96088.7     2.80443e-18
 1000  1000  1000     0.432211      4627.37       0.0707567      28265.9      1.7626e-18
 1100  1100  1100     0.511146       5207.9       0.0186911       142420     1.14521e-18
 1200  1200  1200     0.666975       5181.6        0.025109       137640     7.79963e-19
 1300  1300  1300     0.863769      5087.01       0.0283398       155047     5.45468e-19
 1400  1400  1400      1.09638      5005.57        0.143209      38321.6     3.90302e-19
 1500  1500  1500      1.40352      4809.33        0.120096      56204.9      2.8667e-19
$ cd ../session2
$ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp -DM_MAX=1500 matprod.cc
[ec2-user@ip-10-0-46-255 session2]$ ./a.out
#   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
  100   100   100  0.000471888      4238.29     0.000231317      8646.14               0
  200   200   200   0.00431625      3706.92      0.00121122      13209.9               0
  300   300   300    0.0153292      3522.69      0.00336464      16049.3     1.07937e-06
  400   400   400    0.0317138       4036.1      0.00712568      17963.2     4.06488e-06
  500   500   500     0.052809      4734.04       0.0121626      20554.9     9.09947e-06
  600   600   600    0.0828121      5216.63        0.020657        20913     1.65243e-05
  700   700   700     0.131053      5234.51       0.0318276      21553.6     2.71365e-05
  800   800   800     0.196825      5202.58       0.0482679      21214.9     4.12109e-05
  900   900   900     0.281006      5188.51       0.0671323      21718.3     5.93971e-05
 1000  1000  1000     0.386332      5176.89       0.0906054      22073.7     8.23438e-05
 1100  1100  1100      0.51667      5152.22        0.124346      21408.1     0.000109566
 1200  1200  1200     0.668425      5170.37        0.159701      21640.5     0.000142817
 1300  1300  1300     0.860445      5106.66        0.203472      21595.1     0.000182219
 1400  1400  1400      1.08691      5049.18        0.249427      22002.4     0.000226999
 1500  1500  1500      1.38244      4882.67        0.307519      21949.9     0.000280338

This is on Haswell


On Wednesday, 27 January 2016, 1:39, Michael Lehn <michael.lehn@uni-ulm.de> wrote:



On 23 Jan 2016, at 18:53, palik imre <imre_palik@yahoo.co.uk> wrote:

Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod).  What will we do with that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for this, at the least we would need compile-time access to the target instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use them.

4) Don't include it, and do something completely different.


What do you think?


At the moment I am not sure, but pretty sure, that you don’t have to rewrite uBLAS to support good performance.  uBLAS is a pretty fine
library and already has what you need.  So just extend it. At least for 383 lines of code :-)

As I was pretty busy the last days, so I could not continue until today.  I made some minimal modifications to adopt the GEMM implementation
to take advantage of uBLAS:

- The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)
- Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major.  The other matrices can
be expressions. Here I need some help to get rid of this ugly lines of code:

    TC *C_ = &C(0,0);
    const size_type incRowC = &C(1,0) - &C(0,0);
    const size_type incColC = &C(0,1) - &C(0,0);

- Things like the following should work without performance penalty (and in particularly without creating temporaries):
  blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)
 
- And matrices A and B can be symmetric or whatever.  Of course if A or B is triangular a specialized implementation can take advantage
- The storage format does not matter.  You can mix row-major and col-major, packed, …

Here is the page for this:


Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seems
to be much slower than for general matrices.  So I reduced the problem size for now.  It would be helpful if some people could reproduce the benchmarks
and also check different cases:

- different expressions
- different element types, e.g. A with floats, B with double etc.  At the moment there is only a micro kernel for double.  The performance depends on the
common type of A and B.  So with complex<..> the reference implementation.  But the performance should at least be constant in this case.

It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for these
architectures.  If interested I can extend the benchmarks to compare with Intel MKL.  For Linux there is a free non-commercial version available.


Cheers,

Michael