|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-23 07:24:03
Hi Michael,
I'm yet to check the machine code, but according to my understanding, your gemm approach is optimised for fmadd.
On my old AMD box, setting MR to 4, NR to 16, and flipping the loop on l with the loop on j in ugemm (using cache aligned blocks), caused a more than 2-times speedup compared to vanilla MR = 16, NR = 16 case:
$ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod2 matprod2.cc
$ ./matprod2
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
100 100 100 0.0581846 34.3734 0.00188958 1058.44 0
200 200 200 0.414865 38.5667 0.0123334 1297.29 0
300 300 300 1.37691 39.2183 0.0387481 1393.62 1.0845e-06
400 400 400 3.26061 39.2565 0.0883939 1448.06 4.04763e-06
500 500 500 6.36875 39.2542 0.1813 1378.93 9.09805e-06
600 600 600 10.8398 39.853 0.309016 1397.98 1.65132e-05
700 700 700 17.2165 39.8454 0.484826 1414.94 2.72393e-05
800 800 800 25.6953 39.8516 0.712666 1436.86 4.1317e-05
900 900 900 36.655 39.7763 1.03119 1413.9 5.92538e-05
1000 1000 1000 50.6028 39.5235 1.39929 1429.3 8.21395e-05
$ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod3 matprod3.cc
$ ./matprod3
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
100 100 100 0.0528992 37.8078 0.000628359 3182.89 0
200 200 200 0.414929 38.5609 0.00468125 3417.89 0
300 300 300 1.37729 39.2073 0.014938 3614.94 1.08932e-06
400 400 400 3.26166 39.2438 0.0340622 3757.83 4.06563e-06
500 500 500 6.37056 39.243 0.0670238 3730.02 9.1488e-06
600 600 600 10.8375 39.8615 0.112301 3846.8 1.65072e-05
700 700 700 17.2216 39.8336 0.181384 3782.02 2.71309e-05
800 800 800 25.7673 39.7402 0.280321 3652.95 4.14585e-05
900 900 900 36.6848 39.7439 0.386776 3769.63 5.91959e-05
1000 1000 1000 50.5229 39.586 0.525621 3805.03 8.21298e-05
--------------------------------------------
On Sat, 23/1/16, Michael Lehn <michael.lehn_at_[hidden]> wrote:
Subject: Re: [ublas] Matrix multiplication performance
To: "ublas mailing list" <ublas_at_[hidden]>
Cc: "palik imre" <imre_palik_at_[hidden]>
Date: Saturday, 23 January, 2016, 1:41
I re-run all the
benchmarks with â-Ofast -mavxâ (as the hardware on with
I generate the
benchmarks does not have
AVX2). So now the original ublas::axpy_prod() is
actually
doing way better ...
On 22 Jan
2016, at 21:27, Michael Lehn <michael.lehn_at_[hidden]>
wrote:
> Wow Imre!
>
> Ok, that is actually
a significant difference :-)
>
> I have just added a new version to my
site. Unfortunately the computer I used for creating
> the page does not have posix_memalign().Â
So I had to hack my own function for that. But
> I think for a proof of concept it will
do.
>
> But most of
all I also added micro-kernel that are (fairly) optimised
for AVX and FMA. The
> micro-kernel for
AVX require MR=4, NR=8. For FMA it requires MR=4, NR=16.Â
Otherwise
> the reference implementation
gets selected. For all parameters default values now can
be
> overwritten when compiling, e.g.
>
> Â Â Â g++ -O3 -Wall
-std=c++11 -DHAVE_AVXÂ -DBS_D_MC=512Â matprod.cc
>
> The optimised micro
kernels are only included when compiled with -DHAVE_AVX or
-DHAVE_FMA
>
> I put
all this stuff here
>
> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>
> also the tar-ball
>
> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>
> contains all
required files:
>
>
   session2/avx.hpp
> Â Â Â
session2/fma.hpp
> Â Â Â
session2/gemm.hpp
> Â Â Â
session2/matprod.cc
>
> Of course I would be interested how all
this performs on other platforms.
>
>
> Cheers,
>
> Michael
>
>
> On 22 Jan 2016, at 16:20, palik imre
<imre_palik_at_[hidden]>
wrote:
>
>> Hi
All,
>>
>> In
the meantime I enabled avx2 ...
>>
>> Theoretical CPU performance maximum:
clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>> So it is at ~25%
>>
>> Compiler is
gcc 4.8.3
>>
>>
vanilla run:
>> $ g++ -Wall -W
-std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g
-DNDEBUG -I gemm -o matprod matprod.cc
>> $ ./matprod
>>
#   m    n    kÂ
uBLAS:Â Â Â t1Â Â
   MFLOPS   Blocked:   t2Â
  MFLOPS    Diff nrm1
>>Â
100Â Â Â 100Â Â Â 100Â 0.000350402Â Â Â
5707.73Â Â Â 0.00116104Â Â Â 1722.59Â Â Â Â Â Â
   0
>>Â
200Â Â Â 200Â Â Â 200Â Â Â 0.00445094Â
  3594.74   0.00819996   1951.23   Â
     0
>>Â
300Â Â Â 300Â Â Â 300Â Â 0.0138515Â Â Â
3898.49Â Â Â Â Â 0.0266515Â Â Â 2026.15Â
   1.06987e-06
>>Â
400Â Â Â 400Â Â Â 400Â Â 0.0266447Â Â Â
4803.96Â Â Â Â Â 0.0613506Â Â Â 2086.37Â
   4.01475e-06
>>Â
500Â Â Â 500Â Â Â 500Â Â 0.0424372Â Â Â
5891.06Â Â Â Â 0.119345Â Â Â 2094.77Â
   8.99605e-06
>>Â
600Â Â Â 600Â Â Â 600Â Â 0.0724648Â Â Â
5961.51Â Â Â Â 0.203187Â Â Â 2126.12Â
   1.63618e-05
>>Â
700Â Â Â 700Â Â Â 700Â
   0.115464   5941.25    0.325834Â
  2105.36    2.69547e-05
>>Â
800Â Â Â 800Â Â Â 800Â
   0.173003   5918.98    0.480655Â
  2130.43    4.09449e-05
>>Â
900Â Â Â 900Â Â Â 900Â
   0.248077     5877.2   Â
0.689972Â Â Â 2113.13Â Â Â Â 5.87376e-05
>> 1000Â 1000Â 1000Â Â Â 0.33781Â
  5920.49    0.930591   2149.17Â
   8.16264e-05
>> 1100Â
1100Â 1100Â Â Â Â Â 0.5149Â Â Â 5169.93Â Â
    1.25507      2121Â
   0.000108883
>> 1200Â
1200Â 1200Â Â Â Â 0.670628Â Â Â 5153.38Â Â Â
   1.62732   2123.74Â
   0.000141876
>> 1300Â
1300Â 1300Â Â Â Â 0.852692Â Â Â 5153.09Â Â Â
   2.06708   2125.71Â
   0.000180926
>> 1400Â
1400Â 1400Â Â Â 1.06695Â Â Â 5143.65Â Â Â
   2.56183   2142.22Â
   0.000225975
>> 1500Â
1500Â 1500Â Â Â Â Â 1.3874Â Â
   4865.2      3.16532  Â
2132.49Â Â Â Â 0.000278553
>>
1600Â 1600Â 1600Â Â Â 1.77623Â Â Â 4612.03Â Â Â
  3.8137   2148.05    0.000338106
>> 1700Â 1700Â 1700Â Â
   2.3773   4133.26  Â
   4.56665   2151.69Â
   0.000404458
>> 1800Â
1800Â 1800Â Â Â 3.06381Â Â Â 3807.03Â Â Â
   5.40317   2158.73   0.00048119
>> 1900Â 1900Â 1900Â Â
   3.9039   3513.92  Â
   6.37295   2152.53Â
   0.000564692
>> 2000Â
2000Â 2000Â Â Â 4.79166Â Â Â 3339.13Â Â Â
   7.43399   2152.28Â
   0.000659714
>> 2100Â
2100Â 2100Â Â Â 6.04946Â Â Â 3061.76Â Â Â
   8.62429   2147.65Â
   0.000762223
>> 2200Â
2200Â 2200Â Â Â 7.39085Â Â Â Â Â 2881.4Â Â
    9.86237   2159.32Â
   0.000875624
>> 2300Â
2300Â 2300Â Â Â 9.00453Â Â Â 2702.42Â Â Â
   11.2513   2162.78   0.00100184
>> 2400Â 2400Â 2400Â Â Â 10.3952Â
  2659.68      12.7491  Â
2168.62Â Â Â 0.00113563
>> 2500Â
2500Â 2500Â Â Â 12.2283Â Â Â 2555.55Â Â Â
   14.4615   2160.92   0.00128336
>> 2600Â 2600Â 2600Â Â Â 13.8912Â
  2530.51      16.1965  Â
2170.34Â Â Â 0.00144304
>> 2700Â
2700Â 2700Â Â Â Â Â 15.391Â Â Â 2557.72Â Â
    18.1998   2162.99  Â
0.00161411
>> 2800Â 2800Â 2800Â Â
 17.5673   2499.19      20.2171Â
  2171.63   0.00180035
>>
2900Â 2900Â 2900Â Â Â 19.4621Â Â Â 2506.31Â Â Â
   22.5482   2163.28   0.00199765
>> 3000Â 3000Â 3000Â Â Â 21.4506Â
  2517.42      24.9477  Â
2164.53Â Â Â 0.00221028
>> 3100Â
3100Â 3100Â Â Â Â 23.71Â Â Â 2512.95Â Â Â
   27.5144   2165.48   0.00243877
>> 3200Â 3200Â 3200Â Â Â 25.9051Â
  2529.85      30.2816  Â
2164.22Â Â Â 0.00267766
>> 3300Â
3300Â 3300Â Â Â 28.1949Â Â Â 2549.18Â Â Â Â Â
33.176Â Â Â 2166.45Â Â Â 0.00293379
>> 3400Â 3400Â 3400Â Â Â 30.7235Â
  2558.56      36.0156  Â
2182.61Â Â Â Â Â 0.0032087
>> 3500Â 3500Â 3500Â Â Â 34.0419Â
  2518.95      39.3929  Â
2176.79Â Â Â 0.00349827
>> 3600Â
3600Â 3600Â Â Â 37.0562Â Â Â 2518.12Â Â Â
   42.7524   2182.62   0.00380447
>> 3700Â 3700Â 3700Â Â Â 39.7885Â
  2546.11      46.4748  Â
2179.81Â Â Â 0.00412621
>> 3800Â
3800Â 3800Â Â Â 43.6607Â Â Â 2513.56Â Â Â
   50.2119   2185.62 Â
   0.0044694
>> 3900Â
3900Â 3900Â Â Â 46.5104Â Â Â 2550.78Â Â Â
   54.4822   2177.56   0.00482355
>> 4000Â 4000Â 4000Â Â Â 50.6098Â
  2529.15      58.7686  Â
2178.03Â Â Â 0.00520289
>>
>> tuned run:
>>
>> $ g++ -Wall -W -std=gnu++11 -Ofast
-march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
matprod2 matprod2.cc
>> $
./matprod2
>> #Â Â Â mÂ
   n    k uBLAS:   t1Â
Â
   MFLOPS   Blocked:   t2Â
  MFLOPS    Diff nrm1
>>Â
100Â Â Â 100Â Â Â 100Â 0.000351671Â Â Â
5687.13Â Â Â Â 0.000316612Â Â Â 6316.88Â Â Â
      0
>>Â
200Â Â Â 200Â Â Â 200Â Â Â 0.00419531Â
  3813.78   0.00159044   10060.1   Â
     0
>>Â
300Â Â Â 300Â Â Â 300Â Â 0.0141153Â Â Â
3825.62Â Â Â 0.00421113Â Â Â 12823.2Â
   1.07645e-06
>>Â
400Â Â Â 400Â Â Â 400Â Â 0.0291599Â Â Â
4389.59Â Â Â 0.00858138Â Â Â Â 14916Â
   4.00614e-06
>>Â
500Â Â Â 500Â Â Â 500Â Â 0.0483492Â Â Â
5170.72Â Â Â Â Â 0.0166519Â Â Â 15013.3Â
   8.96808e-06
>>Â
600Â Â Â 600Â Â Â 600Â Â 0.0725783Â Â Â
5952.19Â Â Â Â Â 0.0279634Â Â Â 15448.7Â
   1.63386e-05
>>Â
700Â Â Â 700Â Â Â 700Â
   0.113891   6023.29    0.043077Â
   15925    2.69191e-05
>>Â
800Â Â Â 800Â Â Â 800Â
   0.171416   5973.79 Â
   0.0627796    16311Â
   4.09782e-05
>>Â
900Â Â Â 900Â Â Â 900Â
   0.243677   5983.32 Â
   0.0922766   15800.3Â
   5.88092e-05
>> 1000Â
1000Â 1000Â Â Â Â 0.335158Â Â Â 5967.33Â Â Â
 0.123339   16215.5    8.15988e-05
>> 1100Â 1100Â 1100Â
   0.515776   5161.15  Â
   0.16578   16057.5Â
   0.000108991
>> 1200Â
1200Â 1200Â Â Â Â 0.662706Â Â Â 5214.98Â Â Â
 0.205989   16777.6    0.000141824
>> 1300Â 1300Â 1300Â
   0.845952   5194.15  Â
   0.27637    15899   0.00018111
>> 1400Â 1400Â 1400Â Â Â 1.06712Â
  5142.82    0.332118   16524.2Â
   0.000225958
>> 1500Â
1500Â 1500Â Â Â 1.38147Â Â Â 4886.11Â Â Â Â
0.409224Â Â Â 16494.6Â Â Â Â 0.000278265
>> 1600Â 1600Â 1600Â Â Â 1.72238Â
  4756.21    0.492314   16639.8Â
   0.000338095
>> 1700Â
1700Â 1700Â Â Â 2.38508Â Â Â 4119.77Â Â Â Â
0.603566Â Â Â 16279.9Â Â Â Â 0.000404362
>> 1800Â 1800Â 1800Â Â Â 3.12034Â
  3738.05    0.717409   16258.5Â
   0.000481575
>> 1900Â
1900Â 1900Â Â Â 3.93668Â Â Â 3484.66Â Â Â Â
0.824933Â Â Â 16629.2Â Â Â Â 0.000564727
>> 2000Â 2000Â 2000Â Â Â 4.76038Â
  3361.07    0.941643   16991.6  Â
0.00065862
>> 2100Â 2100Â 2100Â Â
 5.90627   3135.99      1.12226Â
  16504.2    0.000762307
>> 2200Â 2200Â 2200Â Â Â 7.26419Â
  2931.64      1.28213  Â
16609.9Â Â Â Â 0.000876699
>>
2300Â 2300Â 2300Â Â Â 8.88171Â Â Â 2739.79Â Â Â
   1.45247   16753.5   0.00100222
>> 2400Â 2400Â 2400Â Â Â 10.4956Â
  2634.26      1.62705  Â
16992.7Â Â Â 0.00113566
>> 2500Â
2500Â 2500Â Â Â Â Â 11.913Â Â Â 2623.18Â Â
    1.87499   16666.7  Â
0.00128371
>> 2600Â 2600Â 2600Â Â
 13.7057   2564.77     2.1156  Â
16615.6Â Â Â 0.00144259
>> 2700Â
2700Â 2700Â Â Â 15.5959Â Â Â 2524.13Â Â Â
   2.33957   16826.1   0.00161501
>> 2800Â 2800Â 2800Â Â Â 17.1121Â
  2565.67      2.57445  Â
17053.8Â Â Â 0.00179901
>> 2900Â
2900Â 2900Â Â Â 19.4167Â Â Â 2512.16Â Â Â
   2.92445   16679.4   0.00199764
>> 3000Â 3000Â 3000Â Â Â 21.3239Â
  2532.37      3.18891  Â
16933.7Â Â Â 0.00220999
>> 3100Â
3100Â 3100Â Â Â 23.5049Â Â Â 2534.88Â Â Â Â Â
3.5305Â Â Â 16876.4Â Â Â 0.00243845
>> 3200Â 3200Â 3200Â Â Â 25.7362Â
  2546.45      3.81708  Â
17169.1Â Â Â 0.00267581
>> 3300Â
3300Â 3300Â Â Â 28.4467Â Â Â 2526.62Â Â Â
   4.25869    16877   0.00293513
>> 3400Â 3400Â 3400Â Â Â 30.4607Â
  2580.63      4.67999  Â
16796.6Â Â Â 0.00320688
>> 3500Â
3500Â 3500Â Â Â 33.7737Â Â Â 2538.96Â Â Â
   5.04289   17004.1   0.00349667
>> 3600Â 3600Â 3600Â Â Â 36.9633Â
  2524.45       5.414  Â
17235.3Â Â Â 0.00380237
>> 3700Â
3700Â 3700Â Â Â 39.5153Â Â Â 2563.71Â Â Â
   6.04875   16748.2   0.00412583
>> 3800Â 3800Â 3800Â Â Â 42.9412Â
  2555.68      6.48985  Â
16910.1Â Â Â 0.00446785
>> 3900Â
3900Â 3900Â Â Â 46.5282Â Â Â 2549.81Â Â Â
   7.05844    16808   0.00482701
>> 4000Â 4000Â 4000Â Â Â 50.2218Â
  2548.69      7.42442  Â
17240.4Â Â Â 0.00520272
>>
>> As the generated machine code is
completely different, I guess gcc notices the aligned alloc,
and uses the alignment information for optimisation.
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 15:09,
Michael Lehn <michael.lehn_at_[hidden]>
wrote:
>>
>>
>> Hi Imre,
>>
>> thanks for running the
benchmarks. Of course you are right that using aligned
memory for the buffers improves
>>
performance. I also did not really put any effort in
optimising the parameters MC, NC, KC, MR and NR. I will
>> compare different variants and report
them on the website
>>
>> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>>
>> I modified
my benchmark program such that it also computes the FLOPS
as
>>
>> Â Â Â
FLOPS = 2*m*n*k/time_elpased
>>
>> See
>>
>> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>>
>> Could you
re-run your benchmarks and post the different MFLOPS you
get? That is important for actually tuning thing.
>> On my machine my code only reaches 20%
of the peak performance (about 5 GFLOPS instead of 25.6
GFLOPS).   So
>> a speedup
of 2.5 would be impressive but still far from peak
performance.
>>
>> Cheers,
>>
>> Michael
>>
>>
>> On 22 Jan
2016, at 11:03, palik imre <imre_palik_at_[hidden]>
wrote:
>>
>>> Sorry for posting twice more or
less the same thing. I got confused with javascript
interfaces.
>>>
>>> It seems I also forgot to enable
avx for my last measurements. With that + my blocking and
alignment changes, performance according to my tests is
something like 250% higher than running Michael's
original code (with avx).
>>>
>>> Cheers,
>>>
>>>
Imre
>>>
>>>
>>> On
Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]>
wrote:
>>>
>>>
>>> Hi
Michael,
>>>
>>> your blocksizes are far from
optimal. MR & NR should be multiples of the L1 cache
line size (i.e. 16 for double on Intel). Also, the blocks
should be allocated aligned to L1 cache lines (e.g., via
posix_memalign()).
>>>
>>> This alone brought something like
50% speedup for my square matrix test.
>>>
>>> I
will have a look at the other parameters + the whole thing
via perf during the weekend.
>>>
>>> Cheers,
>>>
>>>
Imre
>>>
>>>
>>>
>>> On Friday, 22 January 2016, 0:28,
"ublas-request_at_[hidden]"
<ublas-request_at_[hidden]>
wrote:
>>>
>>>
>>>
Subject: Re: [ublas] Matrix multiplication performance
>>> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>>> Content-Type: text/plain;
charset="windows-1252"
>>>
>>> Hi
Nasos,
>>>
>>> first of all I don?t want to take
wrong credits and want to point out that this is not my
algorithm. It is based on
>>>
>>>Â Â http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>
>>>Â Â
https://github.com/flame/blis
>>>
>>> For
a few cores (4-8) it can easily made multithreaded. For
many-cores like Intel Xeon Phi this is a bit more
>>> sophisticated but still not too
hard. The demo I posted does not use micro kernels that
exploit SSE, AVX or
>>> FMA
instructions. With that the matrix product is on par with
Intel MKL. Just like BLIS. For my platforms I wrote
>>> my own micro-kernels but the
interface of function ugemm is compatible to BLIS.
>>>
>>>
Maybe you could help me to integrate your code in the
benchmark example I posted above.
>>>
>>>
About Blaze:Â Do they have their own implementation of a
matrix-matrix product? It seems to require a
>>> tuned BLAS implementation
(?Otherwise you get only poor performance?) for the
matrix-matrix product.
>>> IMHO
they only have tuned the ?easy? stuff like BLAS Level1 and
Level2. In that case it makes more
>>> sense to compare the performance
with the actual underlying GEMM implementation. But if I
am wrong,
>>> let me know.
>>>
>>>
About the block size: In my experience you get better
performance if you chose them dynamically at runtime
>>> depending on the problem size.Â
Depending on the architecture you can just specify ranges
like 256 - 384 for
>>> blocking
factor MC. In my code it then also needs to satisfy the
restriction that it can be divided by factor MR.
>>> I know that doing things at
compile time is a C++ fetish. But the runtime overhead is
negligible and having
>>> blocks of
similar sizes easily pays of.
>>>
>>> Cheers,
>>>
>>>
Michael
>>>
*************************************
>>>
>>>
>>>
>>>
>>>
_______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: michael.lehn_at_[hidden]
>>
>>
>>
>
>
_______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]
>