Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-23 07:24:03


Hi Michael,

I'm yet to check the machine code, but according to my understanding, your gemm approach is optimised for fmadd.

On my old AMD box, setting MR to 4, NR to 16, and flipping the loop on l with the loop on j in ugemm (using cache aligned blocks), caused a more than 2-times speedup compared to vanilla MR = 16, NR = 16 case:

$ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod2 matprod2.cc
$ ./matprod2
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
  100 100 100 0.0581846 34.3734 0.00188958 1058.44 0
  200 200 200 0.414865 38.5667 0.0123334 1297.29 0
  300 300 300 1.37691 39.2183 0.0387481 1393.62 1.0845e-06
  400 400 400 3.26061 39.2565 0.0883939 1448.06 4.04763e-06
  500 500 500 6.36875 39.2542 0.1813 1378.93 9.09805e-06
  600 600 600 10.8398 39.853 0.309016 1397.98 1.65132e-05
  700 700 700 17.2165 39.8454 0.484826 1414.94 2.72393e-05
  800 800 800 25.6953 39.8516 0.712666 1436.86 4.1317e-05
  900 900 900 36.655 39.7763 1.03119 1413.9 5.92538e-05
 1000 1000 1000 50.6028 39.5235 1.39929 1429.3 8.21395e-05
$ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod3 matprod3.cc
$ ./matprod3
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
  100 100 100 0.0528992 37.8078 0.000628359 3182.89 0
  200 200 200 0.414929 38.5609 0.00468125 3417.89 0
  300 300 300 1.37729 39.2073 0.014938 3614.94 1.08932e-06
  400 400 400 3.26166 39.2438 0.0340622 3757.83 4.06563e-06
  500 500 500 6.37056 39.243 0.0670238 3730.02 9.1488e-06
  600 600 600 10.8375 39.8615 0.112301 3846.8 1.65072e-05
  700 700 700 17.2216 39.8336 0.181384 3782.02 2.71309e-05
  800 800 800 25.7673 39.7402 0.280321 3652.95 4.14585e-05
  900 900 900 36.6848 39.7439 0.386776 3769.63 5.91959e-05
 1000 1000 1000 50.5229 39.586 0.525621 3805.03 8.21298e-05

--------------------------------------------
On Sat, 23/1/16, Michael Lehn <michael.lehn_at_[hidden]> wrote:

 Subject: Re: [ublas] Matrix multiplication performance
 To: "ublas mailing list" <ublas_at_[hidden]>
 Cc: "palik imre" <imre_palik_at_[hidden]>
 Date: Saturday, 23 January, 2016, 1:41
 
 I re-run all the
 benchmarks with “-Ofast -mavx” (as the hardware on with
 I generate the
 benchmarks does not have
 AVX2).  So now the original ublas::axpy_prod() is
 actually
 doing way better ...
 
 
 On 22 Jan
 2016, at 21:27, Michael Lehn <michael.lehn_at_[hidden]>
 wrote:
 
> Wow Imre!
>
> Ok, that is actually
 a significant difference :-)
>
> I have just added a new version to my
 site.  Unfortunately the computer I used for creating
> the page does not have posix_memalign(). 
 So I had to hack my own function for that.  But
> I think for a proof of concept it will
 do.
>
> But most of
 all I also added micro-kernel that are (fairly) optimised
 for AVX and FMA.  The
> micro-kernel for
 AVX require MR=4, NR=8.  For FMA it requires MR=4, NR=16. 
 Otherwise
> the reference implementation
 gets selected.  For all parameters default values now can
 be
> overwritten when compiling, e.g.
>
>     g++ -O3 -Wall
 -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc
>
> The optimised micro
 kernels are only included when compiled with -DHAVE_AVX or
 -DHAVE_FMA
>
> I put
 all this stuff here
>
>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>
> also the tar-ball
 
>
>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>
> contains all
 required files:
>
>
 Â Â Â  session2/avx.hpp
>    
 session2/fma.hpp
>    
 session2/gemm.hpp
>    
 session2/matprod.cc
>
> Of course I would be interested how all
 this performs on other platforms.
>
>
> Cheers,
>
> Michael
>
>
> On 22 Jan 2016, at 16:20, palik imre
 <imre_palik_at_[hidden]>
 wrote:
>
>> Hi
 All,
>>
>> In
 the meantime I enabled avx2 ...
>>
>> Theoretical CPU performance maximum:
 clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>> So it is at ~25%
>>
>> Compiler is
 gcc 4.8.3
>>
>>
 vanilla run:
>> $ g++ -Wall -W
 -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g
 -DNDEBUG -I gemm -o matprod matprod.cc
>> $ ./matprod
>>
 #   m     n     k 
 uBLAS:   t1   
 Â Â Â MFLOPS   Blocked:   t2 
 Â    MFLOPS        Diff nrm1
>> 
 100   100   100  0.000350402     
 5707.73      0.00116104      1722.59           
 Â Â Â 0
>> 
 200   200   200   0.00445094 
 Â    3594.74      0.00819996      1951.23       
 Â       0
>> 
 300   300   300    0.0138515     
 3898.49       0.0266515      2026.15 
 Â Â Â 1.06987e-06
>> 
 400   400   400    0.0266447     
 4803.96       0.0613506      2086.37 
 Â Â Â 4.01475e-06
>> 
 500   500   500    0.0424372     
 5891.06        0.119345      2094.77 
 Â Â Â 8.99605e-06
>> 
 600   600   600    0.0724648     
 5961.51        0.203187      2126.12 
 Â Â Â 1.63618e-05
>> 
 700   700   700 
 Â Â Â 0.115464      5941.25        0.325834 
 Â    2105.36     2.69547e-05
>> 
 800   800   800 
 Â Â Â 0.173003      5918.98        0.480655 
 Â    2130.43     4.09449e-05
>> 
 900   900   900 
 Â Â Â 0.248077       5877.2       
 0.689972      2113.13     5.87376e-05
>> 1000  1000  1000      0.33781 
 Â    5920.49        0.930591      2149.17 
 Â Â Â 8.16264e-05
>> 1100 
 1100  1100       0.5149      5169.93   
 Â     1.25507         2121 
 Â Â Â 0.000108883
>> 1200 
 1200  1200     0.670628      5153.38     
 Â Â Â 1.62732      2123.74 
 Â Â Â 0.000141876
>> 1300 
 1300  1300     0.852692      5153.09     
 Â Â Â 2.06708      2125.71 
 Â Â Â 0.000180926
>> 1400 
 1400  1400      1.06695      5143.65     
 Â Â Â 2.56183      2142.22 
 Â Â Â 0.000225975
>> 1500 
 1500  1500       1.3874   
 Â Â Â 4865.2         3.16532     
 2132.49     0.000278553
>>
 1600  1600  1600      1.77623      4612.03     
 Â    3.8137      2148.05     0.000338106
>> 1700  1700  1700   
 Â Â Â 2.3773      4133.26     
 Â Â Â 4.56665      2151.69 
 Â Â Â 0.000404458
>> 1800 
 1800  1800      3.06381      3807.03     
 Â Â Â 5.40317      2158.73      0.00048119
>> 1900  1900  1900   
 Â Â Â 3.9039      3513.92     
 Â Â Â 6.37295      2152.53 
 Â Â Â 0.000564692
>> 2000 
 2000  2000      4.79166      3339.13     
 Â Â Â 7.43399      2152.28 
 Â Â Â 0.000659714
>> 2100 
 2100  2100      6.04946      3061.76     
 Â Â Â 8.62429      2147.65 
 Â Â Â 0.000762223
>> 2200 
 2200  2200      7.39085       2881.4   
 Â     9.86237      2159.32 
 Â Â Â 0.000875624
>> 2300 
 2300  2300      9.00453      2702.42     
 Â Â Â 11.2513      2162.78      0.00100184
>> 2400  2400  2400      10.3952 
 Â    2659.68         12.7491     
 2168.62      0.00113563
>> 2500 
 2500  2500      12.2283      2555.55     
 Â Â Â 14.4615      2160.92      0.00128336
>> 2600  2600  2600      13.8912 
 Â    2530.51         16.1965     
 2170.34      0.00144304
>> 2700 
 2700  2700       15.391      2557.72   
 Â     18.1998      2162.99     
 0.00161411
>> 2800  2800  2800   
 Â  17.5673      2499.19         20.2171 
 Â    2171.63      0.00180035
>>
 2900  2900  2900      19.4621      2506.31     
 Â Â Â 22.5482      2163.28      0.00199765
>> 3000  3000  3000      21.4506 
 Â    2517.42         24.9477     
 2164.53      0.00221028
>> 3100 
 3100  3100        23.71      2512.95     
 Â Â Â 27.5144      2165.48      0.00243877
>> 3200  3200  3200      25.9051 
 Â    2529.85         30.2816     
 2164.22      0.00267766
>> 3300 
 3300  3300      28.1949      2549.18         
 33.176      2166.45      0.00293379
>> 3400  3400  3400      30.7235 
 Â    2558.56         36.0156     
 2182.61       0.0032087
>> 3500  3500  3500      34.0419 
 Â    2518.95         39.3929     
 2176.79      0.00349827
>> 3600 
 3600  3600      37.0562      2518.12     
 Â Â Â 42.7524      2182.62      0.00380447
>> 3700  3700  3700      39.7885 
 Â    2546.11         46.4748     
 2179.81      0.00412621
>> 3800 
 3800  3800      43.6607      2513.56     
 Â Â Â 50.2119      2185.62   
 Â Â Â 0.0044694
>> 3900 
 3900  3900      46.5104      2550.78     
 Â Â Â 54.4822      2177.56      0.00482355
>> 4000  4000  4000      50.6098 
 Â    2529.15         58.7686     
 2178.03      0.00520289
>>
>> tuned run:
>>
 
>> $ g++ -Wall -W -std=gnu++11 -Ofast
 -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
 matprod2 matprod2.cc
>> $
 ./matprod2
>> #   m 
 Â Â Â n     k  uBLAS:   t1 
 Â 
 Â Â Â MFLOPS   Blocked:   t2 
 Â    MFLOPS        Diff nrm1
>> 
 100   100   100  0.000351671     
 5687.13     0.000316612      6316.88     
 Â         0
>> 
 200   200   200   0.00419531 
 Â    3813.78      0.00159044      10060.1       
 Â       0
>> 
 300   300   300    0.0141153     
 3825.62      0.00421113      12823.2 
 Â Â Â 1.07645e-06
>> 
 400   400   400    0.0291599     
 4389.59      0.00858138        14916 
 Â Â Â 4.00614e-06
>> 
 500   500   500    0.0483492     
 5170.72       0.0166519      15013.3 
 Â Â Â 8.96808e-06
>> 
 600   600   600    0.0725783     
 5952.19       0.0279634      15448.7 
 Â Â Â 1.63386e-05
>> 
 700   700   700 
 Â Â Â 0.113891      6023.29        0.043077 
 Â      15925     2.69191e-05
>> 
 800   800   800 
 Â Â Â 0.171416      5973.79   
 Â Â Â 0.0627796        16311 
 Â Â Â 4.09782e-05
>> 
 900   900   900 
 Â Â Â 0.243677      5983.32   
 Â Â Â 0.0922766      15800.3 
 Â Â Â 5.88092e-05
>> 1000 
 1000  1000     0.335158      5967.33     
 Â  0.123339      16215.5     8.15988e-05
>> 1100  1100  1100 
 Â Â Â 0.515776      5161.15     
 Â Â Â 0.16578      16057.5 
 Â Â Â 0.000108991
>> 1200 
 1200  1200     0.662706      5214.98     
 Â  0.205989      16777.6     0.000141824
>> 1300  1300  1300 
 Â Â Â 0.845952      5194.15     
 Â Â Â 0.27637        15899      0.00018111
>> 1400  1400  1400      1.06712 
 Â    5142.82        0.332118      16524.2 
 Â Â Â 0.000225958
>> 1500 
 1500  1500      1.38147      4886.11       
 0.409224      16494.6     0.000278265
>> 1600  1600  1600      1.72238 
 Â    4756.21        0.492314      16639.8 
 Â Â Â 0.000338095
>> 1700 
 1700  1700      2.38508      4119.77       
 0.603566      16279.9     0.000404362
>> 1800  1800  1800      3.12034 
 Â    3738.05        0.717409      16258.5 
 Â Â Â 0.000481575
>> 1900 
 1900  1900      3.93668      3484.66       
 0.824933      16629.2     0.000564727
>> 2000  2000  2000      4.76038 
 Â    3361.07        0.941643      16991.6     
 0.00065862
>> 2100  2100  2100   
 Â  5.90627      3135.99         1.12226 
 Â    16504.2     0.000762307
>> 2200  2200  2200      7.26419 
 Â    2931.64         1.28213     
 16609.9     0.000876699
>>
 2300  2300  2300      8.88171      2739.79     
 Â Â Â 1.45247      16753.5      0.00100222
>> 2400  2400  2400      10.4956 
 Â    2634.26         1.62705     
 16992.7      0.00113566
>> 2500 
 2500  2500       11.913      2623.18   
 Â     1.87499      16666.7     
 0.00128371
>> 2600  2600  2600   
 Â  13.7057      2564.77          2.1156     
 16615.6      0.00144259
>> 2700 
 2700  2700      15.5959      2524.13     
 Â Â Â 2.33957      16826.1      0.00161501
>> 2800  2800  2800      17.1121 
 Â    2565.67         2.57445     
 17053.8      0.00179901
>> 2900 
 2900  2900      19.4167      2512.16     
 Â Â Â 2.92445      16679.4      0.00199764
>> 3000  3000  3000      21.3239 
 Â    2532.37         3.18891     
 16933.7      0.00220999
>> 3100 
 3100  3100      23.5049      2534.88         
 3.5305      16876.4      0.00243845
>> 3200  3200  3200      25.7362 
 Â    2546.45         3.81708     
 17169.1      0.00267581
>> 3300 
 3300  3300      28.4467      2526.62     
 Â Â Â 4.25869        16877      0.00293513
>> 3400  3400  3400      30.4607 
 Â    2580.63         4.67999     
 16796.6      0.00320688
>> 3500 
 3500  3500      33.7737      2538.96     
 Â Â Â 5.04289      17004.1      0.00349667
>> 3600  3600  3600      36.9633 
 Â    2524.45           5.414     
 17235.3      0.00380237
>> 3700 
 3700  3700      39.5153      2563.71     
 Â Â Â 6.04875      16748.2      0.00412583
>> 3800  3800  3800      42.9412 
 Â    2555.68         6.48985     
 16910.1      0.00446785
>> 3900 
 3900  3900      46.5282      2549.81     
 Â Â Â 7.05844        16808      0.00482701
>> 4000  4000  4000      50.2218 
 Â    2548.69         7.42442     
 17240.4      0.00520272
>>
>> As the generated machine code is
 completely different, I guess gcc notices the aligned alloc,
 and uses the alignment information for optimisation.
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 15:09,
 Michael Lehn <michael.lehn_at_[hidden]>
 wrote:
>>
>>
 
>> Hi Imre,
>>
 
>> thanks for running the
 benchmarks.  Of course you are right that using aligned
 memory for the buffers improves
>>
 performance.  I also did not really put any effort in
 optimising the parameters MC, NC, KC, MR and NR.  I will
>> compare different variants and report
 them on the website
>>
>>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>>
>> I modified
 my benchmark program such that it also computes the FLOPS
 as
>>
>>    
 FLOPS = 2*m*n*k/time_elpased
>>
>> See
>>
>>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>>
>> Could you
 re-run your benchmarks and post the different MFLOPS you
 get? That is important for actually tuning thing.
>> On my machine my code only reaches 20%
 of the peak performance (about 5 GFLOPS instead of 25.6
 GFLOPS).   So
>> a speedup
 of 2.5 would be impressive but still far from peak
 performance.
>>
>> Cheers,
>>
>> Michael
>>
>>
>> On 22 Jan
 2016, at 11:03, palik imre <imre_palik_at_[hidden]>
 wrote:
>>
>>> Sorry for posting twice more or
 less the same thing.  I got confused with javascript
 interfaces.
>>>
>>> It seems I also forgot to enable
 avx for my last measurements.  With that + my blocking and
 alignment changes, performance according to my tests is
 something like 250% higher than running Michael's
 original code (with avx).
>>>
>>> Cheers,
>>>
>>>
 Imre
>>>
>>>
>>> On
 Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]>
 wrote:
>>>
>>>
>>> Hi
 Michael,
>>>
>>> your blocksizes are far from
 optimal.  MR & NR should be multiples of the L1 cache
 line size (i.e. 16 for double on Intel).  Also, the blocks
 should be allocated aligned to L1 cache lines (e.g., via
 posix_memalign()).
>>>
>>> This alone brought something like
 50% speedup for my square matrix test.
>>>
>>> I
 will have a look at the other parameters + the whole thing
 via perf during the weekend.
>>>
 
>>> Cheers,
>>>
>>>
 Imre
>>>
>>>
>>>
>>> On Friday, 22 January 2016, 0:28,
 "ublas-request_at_[hidden]"
 <ublas-request_at_[hidden]>
 wrote:
>>>
>>>
>>>
 Subject: Re: [ublas] Matrix multiplication performance
>>> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>>> Content-Type: text/plain;
 charset="windows-1252"
>>>
>>> Hi
 Nasos,
>>>
>>> first of all I don?t want to take
 wrong credits and want to point out that this is not my
 algorithm.  It is based on
>>>
>>>    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>
>>>   
 https://github.com/flame/blis
>>>
>>> For
 a few cores (4-8) it can easily made multithreaded.  For
 many-cores like Intel Xeon Phi this is a bit more
>>> sophisticated but still not too
 hard.  The demo I posted does not use micro kernels that
 exploit SSE, AVX or
>>> FMA
 instructions.  With that the matrix product is on par with
 Intel MKL.  Just like BLIS. For my platforms I wrote
>>> my own micro-kernels but the
 interface of function ugemm is compatible to BLIS.
>>>
>>>
 Maybe you could help me to integrate your code in the
 benchmark example I posted above.
>>>
>>>
 About Blaze:  Do they have their own implementation of a
 matrix-matrix product?  It seems to require a
>>> tuned BLAS implementation
 (?Otherwise you get only poor performance?) for the
 matrix-matrix product.
>>> IMHO
 they only have tuned the ?easy? stuff like BLAS Level1 and
 Level2.  In that case it makes more
>>> sense to compare the performance
 with the actual underlying GEMM implementation.  But if I
 am wrong,
>>> let me know.
>>>
>>>
 About the block size: In my experience you get better
 performance if you chose them dynamically at runtime
>>> depending on the problem size. 
 Depending on the architecture you can just specify ranges
 like 256 - 384 for
>>> blocking
 factor MC.  In my code it then also needs to satisfy the
 restriction that it can be divided by factor MR.
>>> I know that doing things at
 compile time is a C++ fetish.  But the runtime overhead is
 negligible and having
>>> blocks of
 similar sizes easily pays of.
>>>
 
>>> Cheers,
>>>
>>>
 Michael
>>>
 *************************************
>>>
>>>
>>>
>>>
>>>
 _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: michael.lehn_at_[hidden]
>>
>>
>>
>
>
 _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]
>