|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-24 02:50:30
Hi Michael,
I had a look on your AVX microkernel on my old AMD box. Congratulations, it is twice as fast as what I managed to get out of my optimised C kernel.
I wonder if I can catch up with that using newer gcc + SIMD arrays or intrinsics.
Cheers,
Imre
--------------------------------------------
On Fri, 22/1/16, Michael Lehn <michael.lehn_at_[hidden]> wrote:
Subject: Re: [ublas] Matrix multiplication performance
To: "palik imre" <imre_palik_at_[hidden]>
Cc: "ublas mailing list" <ublas_at_[hidden]>
Date: Friday, 22 January, 2016, 21:27
Wow Imre!
Ok, that is actually a
significant difference :-)
I have just added a new version to my site.Â
Unfortunately the computer I used for creating
the page does not have posix_memalign(). So I
had to hack my own function for that. But
I think for a proof of concept it will do.
But most of all I also added
micro-kernel that are (fairly) optimised for AVX and FMA.Â
The
micro-kernel for AVX require MR=4,
NR=8. For FMA it requires MR=4, NR=16. Otherwise
the reference implementation gets selected.Â
For all parameters default values now can be
overwritten when compiling, e.g.
   g++ -O3 -Wall
-std=c++11 -DHAVE_AVXÂ -DBS_D_MC=512Â matprod.cc
The optimised micro kernels
are only included when compiled with -DHAVE_AVX or
-DHAVE_FMA
I put all this
stuff here
   http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
also the tar-ball
   http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
contains all required
files:
  Â
session2/avx.hpp
   session2/fma.hpp
   session2/gemm.hpp
  Â
session2/matprod.cc
Of
course I would be interested how all this performs on other
platforms.
Cheers,
Michael
On 22
Jan 2016, at 16:20, palik imre <imre_palik_at_[hidden]>
wrote:
> Hi All,
>
> In the meantime I
enabled avx2 ...
>
>
Theoretical CPU performance maximum: clock * vector-size *
ALUs/Core * 2 ~ 69.6GFLOPS
> So it is at
~25%
>
> Compiler is
gcc 4.8.3
>
> vanilla
run:
> $ g++ -Wall -W -std=gnu++11 -Ofast
-march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
matprod matprod.cc
> $ ./matprod
> #   m    nÂ
   k uBLAS:   t1 Â
   MFLOPS   Blocked:   t2Â
  MFLOPS    Diff nrm1
>Â Â Â 100Â Â Â 100Â Â Â 100Â
0.000350402Â Â Â 5707.73Â Â Â 0.00116104Â Â Â
1722.59Â Â Â Â Â Â Â Â Â 0
>Â Â Â 200Â Â Â 200Â Â Â 200Â Â Â 0.00445094Â
  3594.74   0.00819996   1951.23   Â
     0
>Â Â Â 300Â Â Â 300Â Â Â 300Â
 0.0138515   3898.49     0.0266515Â
  2026.15    1.06987e-06
>Â Â Â 400Â Â Â 400Â Â Â 400Â
 0.0266447   4803.96     0.0613506Â
  2086.37    4.01475e-06
>Â Â Â 500Â Â Â 500Â Â Â 500Â
 0.0424372   5891.06    0.119345  Â
2094.77Â Â Â Â 8.99605e-06
>Â Â Â 600Â Â Â 600Â Â Â 600Â
 0.0724648   5961.51    0.203187  Â
2126.12Â Â Â Â 1.63618e-05
>Â Â Â 700Â Â Â 700Â Â Â 700Â
   0.115464   5941.25    0.325834Â
  2105.36    2.69547e-05
>Â Â Â 800Â Â Â 800Â Â Â 800Â
   0.173003   5918.98    0.480655Â
  2130.43    4.09449e-05
>Â Â Â 900Â Â Â 900Â Â Â 900Â
   0.248077     5877.2   Â
0.689972Â Â Â 2113.13Â Â Â Â 5.87376e-05
>Â 1000Â 1000Â 1000Â Â Â 0.33781Â Â
 5920.49    0.930591   2149.17Â
   8.16264e-05
>Â 1100Â
1100Â 1100Â Â Â Â Â 0.5149Â Â Â 5169.93Â Â
    1.25507      2121Â
   0.000108883
>Â 1200Â
1200Â 1200Â Â Â Â 0.670628Â Â Â 5153.38Â Â Â
   1.62732   2123.74Â
   0.000141876
>Â 1300Â
1300Â 1300Â Â Â Â 0.852692Â Â Â 5153.09Â Â Â
   2.06708   2125.71Â
   0.000180926
>Â 1400Â
1400Â 1400Â Â Â 1.06695Â Â Â 5143.65Â Â Â
   2.56183   2142.22Â
   0.000225975
>Â 1500Â
1500Â 1500Â Â Â Â Â 1.3874Â Â
   4865.2      3.16532  Â
2132.49Â Â Â Â 0.000278553
>Â
1600Â 1600Â 1600Â Â Â 1.77623Â Â Â 4612.03Â Â Â
  3.8137   2148.05    0.000338106
>Â 1700Â 1700Â 1700Â Â
   2.3773   4133.26  Â
   4.56665   2151.69Â
   0.000404458
>Â 1800Â
1800Â 1800Â Â Â 3.06381Â Â Â 3807.03Â Â Â
   5.40317   2158.73   0.00048119
>Â 1900Â 1900Â 1900Â Â
   3.9039   3513.92  Â
   6.37295   2152.53Â
   0.000564692
>Â 2000Â
2000Â 2000Â Â Â 4.79166Â Â Â 3339.13Â Â Â
   7.43399   2152.28Â
   0.000659714
>Â 2100Â
2100Â 2100Â Â Â 6.04946Â Â Â 3061.76Â Â Â
   8.62429   2147.65Â
   0.000762223
>Â 2200Â
2200Â 2200Â Â Â 7.39085Â Â Â Â Â 2881.4Â Â
    9.86237   2159.32Â
   0.000875624
>Â 2300Â
2300Â 2300Â Â Â 9.00453Â Â Â 2702.42Â Â Â
   11.2513   2162.78   0.00100184
>Â 2400Â 2400Â 2400Â Â Â 10.3952Â Â
 2659.68      12.7491   2168.62Â
  0.00113563
>Â 2500Â 2500Â 2500Â
  12.2283   2555.55  Â
   14.4615   2160.92   0.00128336
>Â 2600Â 2600Â 2600Â Â Â 13.8912Â Â
 2530.51      16.1965   2170.34Â
  0.00144304
>Â 2700Â 2700Â 2700Â
    15.391   2557.72  Â
   18.1998   2162.99   0.00161411
>Â 2800Â 2800Â 2800Â Â Â 17.5673Â Â
 2499.19      20.2171   2171.63Â
  0.00180035
>Â 2900Â 2900Â 2900Â
  19.4621   2506.31  Â
   22.5482   2163.28   0.00199765
>Â 3000Â 3000Â 3000Â Â Â 21.4506Â Â
 2517.42      24.9477   2164.53Â
  0.00221028
>Â 3100Â 3100Â 3100Â
   23.71   2512.95  Â
   27.5144   2165.48   0.00243877
>Â 3200Â 3200Â 3200Â Â Â 25.9051Â Â
 2529.85      30.2816   2164.22Â
  0.00267766
>Â 3300Â 3300Â 3300Â
  28.1949   2549.18     33.176  Â
2166.45Â Â Â 0.00293379
>Â 3400Â
3400Â 3400Â Â Â 30.7235Â Â Â 2558.56Â Â Â
   36.0156   2182.61 Â
   0.0032087
>Â 3500Â 3500Â
3500Â Â Â 34.0419Â Â Â 2518.95Â Â Â
   39.3929   2176.79   0.00349827
>Â 3600Â 3600Â 3600Â Â Â 37.0562Â Â
 2518.12      42.7524   2182.62Â
  0.00380447
>Â 3700Â 3700Â 3700Â
  39.7885   2546.11  Â
   46.4748   2179.81   0.00412621
>Â 3800Â 3800Â 3800Â Â Â 43.6607Â Â
 2513.56      50.2119   2185.62Â
    0.0044694
>Â 3900Â
3900Â 3900Â Â Â 46.5104Â Â Â 2550.78Â Â Â
   54.4822   2177.56   0.00482355
>Â 4000Â 4000Â 4000Â Â Â 50.6098Â Â
 2529.15      58.7686   2178.03Â
  0.00520289
>
>
tuned run:
>
> $ g++
-Wall -W -std=gnu++11 -Ofast -march=core-avx2
-mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2
matprod2.cc
> $ ./matprod2
> #   m    nÂ
   k uBLAS:   t1 Â
   MFLOPS   Blocked:   t2Â
  MFLOPS    Diff nrm1
>Â Â Â 100Â Â Â 100Â Â Â 100Â
0.000351671Â Â Â 5687.13Â Â Â Â 0.000316612Â Â
 6316.88         0
>Â Â Â 200Â Â Â 200Â Â Â 200Â Â Â 0.00419531Â
  3813.78   0.00159044   10060.1   Â
     0
>Â Â Â 300Â Â Â 300Â Â Â 300Â
 0.0141153   3825.62   0.00421113  Â
12823.2Â Â Â Â 1.07645e-06
>Â Â Â 400Â Â Â 400Â Â Â 400Â
 0.0291599   4389.59   0.00858138   Â
14916Â Â Â Â 4.00614e-06
>Â Â Â 500Â Â Â 500Â Â Â 500Â
 0.0483492   5170.72     0.0166519Â
  15013.3    8.96808e-06
>Â Â Â 600Â Â Â 600Â Â Â 600Â
 0.0725783   5952.19     0.0279634Â
  15448.7    1.63386e-05
>Â Â Â 700Â Â Â 700Â Â Â 700Â
   0.113891   6023.29    0.043077Â
   15925    2.69191e-05
>Â Â Â 800Â Â Â 800Â Â Â 800Â
   0.171416   5973.79 Â
   0.0627796    16311Â
   4.09782e-05
>Â Â Â 900Â Â Â 900Â Â Â 900Â
   0.243677   5983.32 Â
   0.0922766   15800.3Â
   5.88092e-05
>Â 1000Â
1000Â 1000Â Â Â Â 0.335158Â Â Â 5967.33Â Â Â
 0.123339   16215.5    8.15988e-05
>Â 1100Â 1100Â 1100Â
   0.515776   5161.15  Â
   0.16578   16057.5Â
   0.000108991
>Â 1200Â
1200Â 1200Â Â Â Â 0.662706Â Â Â 5214.98Â Â Â
 0.205989   16777.6    0.000141824
>Â 1300Â 1300Â 1300Â
   0.845952   5194.15  Â
   0.27637    15899   0.00018111
>Â 1400Â 1400Â 1400Â Â Â 1.06712Â Â
 5142.82    0.332118   16524.2Â
   0.000225958
>Â 1500Â
1500Â 1500Â Â Â 1.38147Â Â Â 4886.11Â Â Â Â
0.409224Â Â Â 16494.6Â Â Â Â 0.000278265
>Â 1600Â 1600Â 1600Â Â Â 1.72238Â Â
 4756.21    0.492314   16639.8Â
   0.000338095
>Â 1700Â
1700Â 1700Â Â Â 2.38508Â Â Â 4119.77Â Â Â Â
0.603566Â Â Â 16279.9Â Â Â Â 0.000404362
>Â 1800Â 1800Â 1800Â Â Â 3.12034Â Â
 3738.05    0.717409   16258.5Â
   0.000481575
>Â 1900Â
1900Â 1900Â Â Â 3.93668Â Â Â 3484.66Â Â Â Â
0.824933Â Â Â 16629.2Â Â Â Â 0.000564727
>Â 2000Â 2000Â 2000Â Â Â 4.76038Â Â
 3361.07    0.941643   16991.6  Â
0.00065862
>Â 2100Â 2100Â 2100Â Â Â
5.90627Â Â Â 3135.99Â Â Â Â Â Â 1.12226Â Â
 16504.2    0.000762307
>Â 2200Â 2200Â 2200Â Â Â 7.26419Â Â
 2931.64      1.28213   16609.9Â
   0.000876699
>Â 2300Â
2300Â 2300Â Â Â 8.88171Â Â Â 2739.79Â Â Â
   1.45247   16753.5   0.00100222
>Â 2400Â 2400Â 2400Â Â Â 10.4956Â Â
 2634.26      1.62705   16992.7Â
  0.00113566
>Â 2500Â 2500Â 2500Â
    11.913   2623.18  Â
   1.87499   16666.7   0.00128371
>Â 2600Â 2600Â 2600Â Â Â 13.7057Â Â
 2564.77     2.1156   16615.6  Â
0.00144259
>Â 2700Â 2700Â 2700Â Â Â
15.5959Â Â Â 2524.13Â Â Â Â Â Â 2.33957Â Â
 16826.1   0.00161501
>Â 2800Â
2800Â 2800Â Â Â 17.1121Â Â Â 2565.67Â Â Â
   2.57445   17053.8   0.00179901
>Â 2900Â 2900Â 2900Â Â Â 19.4167Â Â
 2512.16      2.92445   16679.4Â
  0.00199764
>Â 3000Â 3000Â 3000Â
  21.3239   2532.37  Â
   3.18891   16933.7   0.00220999
>Â 3100Â 3100Â 3100Â Â Â 23.5049Â Â
 2534.88     3.5305   16876.4  Â
0.00243845
>Â 3200Â 3200Â 3200Â Â Â
25.7362Â Â Â 2546.45Â Â Â Â Â Â 3.81708Â Â
 17169.1   0.00267581
>Â 3300Â
3300Â 3300Â Â Â 28.4467Â Â Â 2526.62Â Â Â
   4.25869    16877   0.00293513
>Â 3400Â 3400Â 3400Â Â Â 30.4607Â Â
 2580.63      4.67999   16796.6Â
  0.00320688
>Â 3500Â 3500Â 3500Â
  33.7737   2538.96  Â
   5.04289   17004.1   0.00349667
>Â 3600Â 3600Â 3600Â Â Â 36.9633Â Â
 2524.45       5.414   17235.3Â
  0.00380237
>Â 3700Â 3700Â 3700Â
  39.5153   2563.71  Â
   6.04875   16748.2   0.00412583
>Â 3800Â 3800Â 3800Â Â Â 42.9412Â Â
 2555.68      6.48985   16910.1Â
  0.00446785
>Â 3900Â 3900Â 3900Â
  46.5282   2549.81  Â
   7.05844    16808   0.00482701
>Â 4000Â 4000Â 4000Â Â Â 50.2218Â Â
 2548.69      7.42442   17240.4Â
  0.00520272
>
>
As the generated machine code is completely different, I
guess gcc notices the aligned alloc, and uses the alignment
information for optimisation.
>
> Cheers,
>
> Imre
>
>
> On Friday, 22
January 2016, 15:09, Michael Lehn <michael.lehn_at_[hidden]>
wrote:
>
>
> Hi Imre,
>
> thanks for running the benchmarks. Of
course you are right that using aligned memory for the
buffers improves
> performance. I also
did not really put any effort in optimising the parameters
MC, NC, KC, MR and NR. I will
> compare
different variants and report them on the website
>
> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>
> I modified my
benchmark program such that it also computes the FLOPS as
>
> Â Â Â FLOPS =
2*m*n*k/time_elpased
>
> See
>
> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>
> Could you re-run
your benchmarks and post the different MFLOPS you get? That
is important for actually tuning thing.
>
On my machine my code only reaches 20% of the peak
performance (about 5 GFLOPS instead of 25.6
GFLOPS).   So
> a speedup of
2.5 would be impressive but still far from peak
performance.
>
>
Cheers,
>
>
Michael
>
>
> On 22 Jan 2016, at 11:03, palik imre
<imre_palik_at_[hidden]>
wrote:
>
>> Sorry
for posting twice more or less the same thing. I got
confused with javascript interfaces.
>>
>> It seems I
also forgot to enable avx for my last measurements. With
that + my blocking and alignment changes, performance
according to my tests is something like 250% higher than
running Michael's original code (with avx).
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 10:33,
palik imre <imre_palik_at_[hidden]>
wrote:
>>
>>
>> Hi Michael,
>>
>> your
blocksizes are far from optimal. MR & NR should be
multiples of the L1 cache line size (i.e. 16 for double on
Intel). Also, the blocks should be allocated aligned to L1
cache lines (e.g., via posix_memalign()).
>>
>> This alone
brought something like 50% speedup for my square matrix
test.
>>
>> I
will have a look at the other parameters + the whole thing
via perf during the weekend.
>>
>> Cheers,
>>
>> Imre
>>
>>
>>
>> On Friday, 22 January 2016, 0:28,
"ublas-request_at_[hidden]"
<ublas-request_at_[hidden]>
wrote:
>>
>>
>> Subject: Re: [ublas] Matrix
multiplication performance
>>
Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>> Content-Type: text/plain;
charset="windows-1252"
>>
>> Hi Nasos,
>>
>> first of all I don?t want to take
wrong credits and want to point out that this is not my
algorithm. It is based on
>>
>>Â Â Â Â http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>
>>Â
   https://github.com/flame/blis
>>
>> For a few
cores (4-8) it can easily made multithreaded. For
many-cores like Intel Xeon Phi this is a bit more
>> sophisticated but still not too
hard. The demo I posted does not use micro kernels that
exploit SSE, AVX or
>> FMA
instructions. With that the matrix product is on par with
Intel MKL. Just like BLIS. For my platforms I wrote
>> my own micro-kernels but the interface
of function ugemm is compatible to BLIS.
>>
>> Maybe you
could help me to integrate your code in the benchmark
example I posted above.
>>
>> About Blaze:Â Do they have their own
implementation of a matrix-matrix product? It seems to
require a
>> tuned BLAS implementation
(?Otherwise you get only poor performance?) for the
matrix-matrix product.
>> IMHO they
only have tuned the ?easy? stuff like BLAS Level1 and
Level2. In that case it makes more
>> sense to compare the performance with
the actual underlying GEMM implementation. But if I am
wrong,
>> let me know.
>>
>> About the
block size: In my experience you get better performance if
you chose them dynamically at runtime
>> depending on the problem size.Â
Depending on the architecture you can just specify ranges
like 256 - 384 for
>> blocking factor
MC. In my code it then also needs to satisfy the
restriction that it can be divided by factor MR.
>> I know that doing things at compile
time is a C++ fetish. But the runtime overhead is
negligible and having
>> blocks of
similar sizes easily pays of.
>>
>> Cheers,
>>
>> Michael
>>
*************************************
>>
>>
>>
>>
>>
_______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
>
>
>