On Sun, Jan 24, 2016 at 8:50 AM, palik imre <imre_palik@yahoo.co.uk> wrote:

Hi Michael,

I had a look on your AVX microkernel on my old AMD box. Congratulations, it is twice as fast as what I managed to get out of my optimised C kernel.

I wonder if I can catch up with that using newer gcc + SIMD arrays or intrinsics.

Cheers,

Imre

--------------------------------------------
On Fri, 22/1/16, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

Subject: Re: [ublas] Matrix multiplication performance
To: "palik imre" <imre_palik@yahoo.co.uk>
Cc: "ublas mailing list" <ublas@lists.boost.org>
Date: Friday, 22 January, 2016, 21:27

Wow Imre!

Ok, that is actually a
significant difference :-)

I have just added a new version to my site.
Unfortunately the computer I used for creating
the page does not have posix_memalign(). So I
had to hack my own function for that. But
I think for a proof of concept it will do.

But most of all I also added
micro-kernel that are (fairly) optimised for AVX and FMA.
The
micro-kernel for AVX require MR=4,
NR=8. For FMA it requires MR=4, NR=16. Otherwise
the reference implementation gets selected.
For all parameters default values now can be
overwritten when compiling, e.g.

     g++ -O3 -Wall
-std=c++11 -DHAVE_AVX -DBS_D_MC=512 matprod.cc

The optimised micro kernels
are only included when compiled with -DHAVE_AVX or
-DHAVE_FMA

I put all this
stuff here

     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html

also the tar-ball

     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz

contains all required
files:


session2/avx.hpp
     session2/fma.hpp
     session2/gemm.hpp

session2/matprod.cc

Of
course I would be interested how all this performs on other
platforms.

Cheers,

Michael

On 22
Jan 2016, at 16:20, palik imre <imre_palik@yahoo.co.uk>
wrote:

> Hi All,
>
> In the meantime I
enabled avx2 ...
>
>
Theoretical CPU performance maximum: clock * vector-size *
ALUs/Core * 2 ~ 69.6GFLOPS
> So it is at
~25%
>
> Compiler is
gcc 4.8.3
>
> vanilla
run:
> $ g++ -Wall -W -std=gnu++11 -Ofast
-march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
matprod matprod.cc
> $ ./matprod
> #   m    n
    k uBLAS:   t1
    MFLOPS   Blocked:   t2
   MFLOPS Diff nrm1
>   100   100   100
0.000350402 5707.73 0.00116104
1722.59    0
>   200   200   200   0.00445094
   3594.74 0.00819996 1951.23
      0
>   300   300   300
   0.0138515 3898.49    0.0266515
   2026.15    1.06987e-06
>   400   400   400
   0.0266447 4803.96    0.0613506
   2086.37    4.01475e-06
>   500   500   500
   0.0424372 5891.06 0.119345
2094.77    8.99605e-06
>   600   600   600
   0.0724648 5961.51 0.203187
2126.12    1.63618e-05
>   700   700   700
    0.115464 5941.25 0.325834
   2105.36    2.69547e-05
>   800   800   800
    0.173003 5918.98 0.480655
   2130.43    4.09449e-05
>   900   900   900
    0.248077    5877.2
0.689972 2113.13    5.87376e-05
> 1000 1000 1000 0.33781
   5920.49 0.930591 2149.17
    8.16264e-05
> 1100
1100 1100    0.5149 5169.93
      1.25507    2121
    0.000108883
> 1200
1200 1200    0.670628 5153.38
    1.62732 2123.74
    0.000141876
> 1300
1300 1300    0.852692 5153.09
    2.06708 2125.71
    0.000180926
> 1400
1400 1400 1.06695 5143.65
    2.56183 2142.22
    0.000225975
> 1500
1500 1500    1.3874
    4865.2    3.16532
2132.49    0.000278553
>
1600 1600 1600 1.77623 4612.03
   3.8137 2148.05    0.000338106
> 1700 1700 1700
    2.3773 4133.26
    4.56665 2151.69
    0.000404458
> 1800
1800 1800 3.06381 3807.03
    5.40317 2158.73 0.00048119
> 1900 1900 1900
    3.9039 3513.92
    6.37295 2152.53
    0.000564692
> 2000
2000 2000 4.79166 3339.13
    7.43399 2152.28
    0.000659714
> 2100
2100 2100 6.04946 3061.76
    8.62429 2147.65
    0.000762223
> 2200
2200 2200 7.39085    2881.4
      9.86237 2159.32
    0.000875624
> 2300
2300 2300 9.00453 2702.42
    11.2513 2162.78 0.00100184
> 2400 2400 2400 10.3952
   2659.68    12.7491 2168.62
   0.00113563
> 2500 2500 2500
   12.2283 2555.55
    14.4615 2160.92 0.00128336
> 2600 2600 2600 13.8912
   2530.51    16.1965 2170.34
   0.00144304
> 2700 2700 2700
      15.391 2557.72
    18.1998 2162.99 0.00161411
> 2800 2800 2800 17.5673
   2499.19    20.2171 2171.63
   0.00180035
> 2900 2900 2900
   19.4621 2506.31
    22.5482 2163.28 0.00199765
> 3000 3000 3000 21.4506
   2517.42    24.9477 2164.53
   0.00221028
> 3100 3100 3100
   23.71 2512.95
    27.5144 2165.48 0.00243877
> 3200 3200 3200 25.9051
   2529.85    30.2816 2164.22
   0.00267766
> 3300 3300 3300
   28.1949 2549.18 33.176
2166.45 0.00293379
> 3400
3400 3400 30.7235 2558.56
    36.0156 2182.61
    0.0032087
> 3500 3500
3500 34.0419 2518.95
    39.3929 2176.79 0.00349827
> 3600 3600 3600 37.0562
   2518.12    42.7524 2182.62
   0.00380447
> 3700 3700 3700
   39.7885 2546.11
    46.4748 2179.81 0.00412621
> 3800 3800 3800 43.6607
   2513.56    50.2119 2185.62
      0.0044694
> 3900
3900 3900 46.5104 2550.78
    54.4822 2177.56 0.00482355
> 4000 4000 4000 50.6098
   2529.15    58.7686 2178.03
   0.00520289
>
>
tuned run:
>
> $ g++
-Wall -W -std=gnu++11 -Ofast -march=core-avx2
-mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2
matprod2.cc
> $ ./matprod2
> #   m    n
    k uBLAS:   t1
    MFLOPS   Blocked:   t2
   MFLOPS Diff nrm1
>   100   100   100
0.000351671 5687.13    0.000316612
   6316.88    0
>   200   200   200   0.00419531
   3813.78 0.00159044 10060.1
      0
>   300   300   300
   0.0141153 3825.62 0.00421113
12823.2    1.07645e-06
>   400   400   400
   0.0291599 4389.59 0.00858138
14916    4.00614e-06
>   500   500   500
   0.0483492 5170.72    0.0166519
   15013.3    8.96808e-06
>   600   600   600
   0.0725783 5952.19    0.0279634
   15448.7    1.63386e-05
>   700   700   700
    0.113891 6023.29 0.043077
   15925    2.69191e-05
>   800   800   800
    0.171416 5973.79
    0.0627796 16311
    4.09782e-05
>   900   900   900
    0.243677 5983.32
    0.0922766 15800.3
    5.88092e-05
> 1000
1000 1000    0.335158 5967.33
   0.123339 16215.5    8.15988e-05
> 1100 1100 1100
    0.515776 5161.15
    0.16578 16057.5
    0.000108991
> 1200
1200 1200    0.662706 5214.98
   0.205989 16777.6    0.000141824
> 1300 1300 1300
    0.845952 5194.15
    0.27637 15899 0.00018111
> 1400 1400 1400 1.06712
   5142.82 0.332118 16524.2
    0.000225958
> 1500
1500 1500 1.38147 4886.11
0.409224 16494.6    0.000278265
> 1600 1600 1600 1.72238
   4756.21 0.492314 16639.8
    0.000338095
> 1700
1700 1700 2.38508 4119.77
0.603566 16279.9    0.000404362
> 1800 1800 1800 3.12034
   3738.05 0.717409 16258.5
    0.000481575
> 1900
1900 1900 3.93668 3484.66
0.824933 16629.2    0.000564727
> 2000 2000 2000 4.76038
   3361.07 0.941643 16991.6
0.00065862
> 2100 2100 2100
5.90627 3135.99    1.12226
   16504.2    0.000762307
> 2200 2200 2200 7.26419
   2931.64    1.28213 16609.9
    0.000876699
> 2300
2300 2300 8.88171 2739.79
    1.45247 16753.5 0.00100222
> 2400 2400 2400 10.4956
   2634.26    1.62705 16992.7
   0.00113566
> 2500 2500 2500
      11.913 2623.18
    1.87499 16666.7 0.00128371
> 2600 2600 2600 13.7057
   2564.77 2.1156 16615.6
0.00144259
> 2700 2700 2700
15.5959 2524.13    2.33957
   16826.1 0.00161501
> 2800
2800 2800 17.1121 2565.67
    2.57445 17053.8 0.00179901
> 2900 2900 2900 19.4167
   2512.16    2.92445 16679.4
   0.00199764
> 3000 3000 3000
   21.3239 2532.37
    3.18891 16933.7 0.00220999
> 3100 3100 3100 23.5049
   2534.88 3.5305 16876.4
0.00243845
> 3200 3200 3200
25.7362 2546.45    3.81708
   17169.1 0.00267581
> 3300
3300 3300 28.4467 2526.62
    4.25869 16877 0.00293513
> 3400 3400 3400 30.4607
   2580.63    4.67999 16796.6
   0.00320688
> 3500 3500 3500
   33.7737 2538.96
    5.04289 17004.1 0.00349667
> 3600 3600 3600 36.9633
   2524.45    5.414 17235.3
   0.00380237
> 3700 3700 3700
   39.5153 2563.71
    6.04875 16748.2 0.00412583
> 3800 3800 3800 42.9412
   2555.68    6.48985 16910.1
   0.00446785
> 3900 3900 3900
   46.5282 2549.81
    7.05844 16808 0.00482701
> 4000 4000 4000 50.2218
   2548.69    7.42442 17240.4
   0.00520272
>
>
As the generated machine code is completely different, I
guess gcc notices the aligned alloc, and uses the alignment
information for optimisation.
>
> Cheers,
>
> Imre
>
>
> On Friday, 22
January 2016, 15:09, Michael Lehn <michael.lehn@uni-ulm.de>
wrote:
>
>
> Hi Imre,
>
> thanks for running the benchmarks. Of
course you are right that using aligned memory for the
buffers improves
> performance. I also
did not really put any effort in optimising the parameters
MC, NC, KC, MR and NR. I will
> compare
different variants and report them on the website
>
>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>
> I modified my
benchmark program such that it also computes the FLOPS as
>
>     FLOPS =
2*m*n*k/time_elpased
>
> See
>
>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>
> Could you re-run
your benchmarks and post the different MFLOPS you get? That
is important for actually tuning thing.
>
On my machine my code only reaches 20% of the peak
performance (about 5 GFLOPS instead of 25.6
GFLOPS).   So
> a speedup of
2.5 would be impressive but still far from peak
performance.
>
>
Cheers,
>
>
Michael
>
>
> On 22 Jan 2016, at 11:03, palik imre
<imre_palik@yahoo.co.uk>
wrote:
>
>> Sorry
for posting twice more or less the same thing. I got
confused with javascript interfaces.
>>
>> It seems I
also forgot to enable avx for my last measurements. With
that + my blocking and alignment changes, performance
according to my tests is something like 250% higher than
running Michael's original code (with avx).
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 10:33,
palik imre <imre_palik@yahoo.co.uk>
wrote:
>>
>>

>> Hi Michael,
>>
>> your
blocksizes are far from optimal. MR & NR should be
multiples of the L1 cache line size (i.e. 16 for double on
Intel). Also, the blocks should be allocated aligned to L1
cache lines (e.g., via posix_memalign()).
>>
>> This alone
brought something like 50% speedup for my square matrix
test.
>>
>> I
will have a look at the other parameters + the whole thing
via perf during the weekend.
>>
>> Cheers,
>>
>> Imre
>>
>>
>>
>> On Friday, 22 January 2016, 0:28,
"ublas-request@lists.boost.org"
<ublas-request@lists.boost.org>
wrote:
>>
>>

>> Subject: Re: [ublas] Matrix
multiplication performance
>>
Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B@uni-ulm.de>
>> Content-Type: text/plain;
charset="windows-1252"
>>

>> Hi Nasos,
>>

>> first of all I don?t want to take
wrong credits and want to point out that this is not my
algorithm. It is based on
>>
>>    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>
>>
    https://github.com/flame/blis
>>
>> For a few
cores (4-8) it can easily made multithreaded. For
many-cores like Intel Xeon Phi this is a bit more
>> sophisticated but still not too
hard. The demo I posted does not use micro kernels that
exploit SSE, AVX or
>> FMA
instructions. With that the matrix product is on par with
Intel MKL. Just like BLIS. For my platforms I wrote
>> my own micro-kernels but the interface
of function ugemm is compatible to BLIS.
>>
>> Maybe you
could help me to integrate your code in the benchmark
example I posted above.
>>
>> About Blaze: Do they have their own
implementation of a matrix-matrix product? It seems to
require a
>> tuned BLAS implementation
(?Otherwise you get only poor performance?) for the
matrix-matrix product.
>> IMHO they
only have tuned the ?easy? stuff like BLAS Level1 and
Level2. In that case it makes more
>> sense to compare the performance with
the actual underlying GEMM implementation. But if I am
wrong,
>> let me know.
>>
>> About the
block size: In my experience you get better performance if
you chose them dynamically at runtime
>> depending on the problem size.
Depending on the architecture you can just specify ranges
like 256 - 384 for
>> blocking factor
MC. In my code it then also needs to satisfy the
restriction that it can be divided by factor MR.
>> I know that doing things at compile
time is a C++ fetish. But the runtime overhead is
negligible and having
>> blocks of
similar sizes easily pays of.
>>
>> Cheers,
>>
>> Michael
>>
*************************************
>>
>>
>>
>>
>>
_______________________________________________
>> ublas mailing list
>> ublas@lists.boost.org
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn@uni-ulm.de
>
>
>
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: rrossi@cimne.upc.edu

Riccardo Rossi

PhD, Civil Engineer

member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)

C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com -

T.(+34) 93 401 56 96 skype: rougered4

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a cimne@cimne.upc.edu. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

Imprimiu aquest missatge, només si és estrictament necessari.