Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-23 07:47:12


> I'm yet to check the machine code, but according to my understanding, your gemm approach is optimised for fmadd.

Only the micro-kernel enabled with -DHAVE_FMA uses fused-multiply-add instructions. It should work if
cat /proc/cpuinfo shows “fma”. Otherwise the micro-kernel with -DHAVE_AVX should work.

>
> On my old AMD box, setting MR to 4, NR to 16, and flipping the loop on l with the loop on j in ugemm (using cache aligned blocks), caused a more than 2-times speedup compared to vanilla MR = 16, NR = 16 case:

Ok, that can be. I never really have gone into optimising the plain-C++-micro-kernel. I just used it as
a reference implementation. But I guess it is not easy to find general good constants. In a real world
library it might be good to have some config-benchmarks that empirically set these. Similar to ATLAS.

Also the optimality of the assembly micro-kernel are tricky. E.g. I have two older machines that only have
SSE (Intel-Core-Duo and a Xeon). Only looking at the SSE3 entry at /proc/cpuinfo would suggest that
the same micro-kernel should be used. But actually they require different asm-micro-kernel that realise
different algorithms. One works with register-broadcasts the other with register-permutations. Choosing
the “wrong” micro-kernel gives terrible results. Choosing the “right” micro-kernel gets close to MKL.

>
>
> $ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod2 matprod2.cc
> $ ./matprod2
> # m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
> 100 100 100 0.0581846 34.3734 0.00188958 1058.44 0
> 200 200 200 0.414865 38.5667 0.0123334 1297.29 0
> 300 300 300 1.37691 39.2183 0.0387481 1393.62 1.0845e-06
> 400 400 400 3.26061 39.2565 0.0883939 1448.06 4.04763e-06
> 500 500 500 6.36875 39.2542 0.1813 1378.93 9.09805e-06
> 600 600 600 10.8398 39.853 0.309016 1397.98 1.65132e-05
> 700 700 700 17.2165 39.8454 0.484826 1414.94 2.72393e-05
> 800 800 800 25.6953 39.8516 0.712666 1436.86 4.1317e-05
> 900 900 900 36.655 39.7763 1.03119 1413.9 5.92538e-05
> 1000 1000 1000 50.6028 39.5235 1.39929 1429.3 8.21395e-05
> $ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod3 matprod3.cc
> $ ./matprod3
> # m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
> 100 100 100 0.0528992 37.8078 0.000628359 3182.89 0
> 200 200 200 0.414929 38.5609 0.00468125 3417.89 0
> 300 300 300 1.37729 39.2073 0.014938 3614.94 1.08932e-06
> 400 400 400 3.26166 39.2438 0.0340622 3757.83 4.06563e-06
> 500 500 500 6.37056 39.243 0.0670238 3730.02 9.1488e-06
> 600 600 600 10.8375 39.8615 0.112301 3846.8 1.65072e-05
> 700 700 700 17.2216 39.8336 0.181384 3782.02 2.71309e-05
> 800 800 800 25.7673 39.7402 0.280321 3652.95 4.14585e-05
> 900 900 900 36.6848 39.7439 0.386776 3769.63 5.91959e-05
> 1000 1000 1000 50.5229 39.586 0.525621 3805.03 8.21298e-05
>
>
>
> --------------------------------------------
> On Sat, 23/1/16, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>
> Subject: Re: [ublas] Matrix multiplication performance
> To: "ublas mailing list" <ublas_at_[hidden]>
> Cc: "palik imre" <imre_palik_at_[hidden]>
> Date: Saturday, 23 January, 2016, 1:41
>
> I re-run all the
> benchmarks with “-Ofast -mavx” (as the hardware on with
> I generate the
> benchmarks does not have
> AVX2). So now the original ublas::axpy_prod() is
> actually
> doing way better ...
>
>
> On 22 Jan
> 2016, at 21:27, Michael Lehn <michael.lehn_at_[hidden]>
> wrote:
>
>> Wow Imre!
>>
>> Ok, that is actually
> a significant difference :-)
>>
>> I have just added a new version to my
> site. Unfortunately the computer I used for creating
>> the page does not have posix_memalign().
> So I had to hack my own function for that. But
>> I think for a proof of concept it will
> do.
>>
>> But most of
> all I also added micro-kernel that are (fairly) optimised
> for AVX and FMA. The
>> micro-kernel for
> AVX require MR=4, NR=8. For FMA it requires MR=4, NR=16.
> Otherwise
>> the reference implementation
> gets selected. For all parameters default values now can
> be
>> overwritten when compiling, e.g.
>>
>> g++ -O3 -Wall
> -std=c++11 -DHAVE_AVX -DBS_D_MC=512 matprod.cc
>>
>> The optimised micro
> kernels are only included when compiled with -DHAVE_AVX or
> -DHAVE_FMA
>>
>> I put
> all this stuff here
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>>
>> also the tar-ball
>
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>>
>> contains all
> required files:
>>
>>
> session2/avx.hpp
>>
> session2/fma.hpp
>>
> session2/gemm.hpp
>>
> session2/matprod.cc
>>
>> Of course I would be interested how all
> this performs on other platforms.
>>
>>
>> Cheers,
>>
>> Michael
>>
>>
>> On 22 Jan 2016, at 16:20, palik imre
> <imre_palik_at_[hidden]>
> wrote:
>>
>>> Hi
> All,
>>>
>>> In
> the meantime I enabled avx2 ...
>>>
>>> Theoretical CPU performance maximum:
> clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>>> So it is at ~25%
>>>
>>> Compiler is
> gcc 4.8.3
>>>
>>>
> vanilla run:
>>> $ g++ -Wall -W
> -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g
> -DNDEBUG -I gemm -o matprod matprod.cc
>>> $ ./matprod
>>>
> # m n k
> uBLAS: t1
> MFLOPS Blocked: t2
> MFLOPS Diff nrm1
>>>
> 100 100 100 0.000350402
> 5707.73 0.00116104 1722.59
> 0
>>>
> 200 200 200 0.00445094
> 3594.74 0.00819996 1951.23
> 0
>>>
> 300 300 300 0.0138515
> 3898.49 0.0266515 2026.15
> 1.06987e-06
>>>
> 400 400 400 0.0266447
> 4803.96 0.0613506 2086.37
> 4.01475e-06
>>>
> 500 500 500 0.0424372
> 5891.06 0.119345 2094.77
> 8.99605e-06
>>>
> 600 600 600 0.0724648
> 5961.51 0.203187 2126.12
> 1.63618e-05
>>>
> 700 700 700
> 0.115464 5941.25 0.325834
> 2105.36 2.69547e-05
>>>
> 800 800 800
> 0.173003 5918.98 0.480655
> 2130.43 4.09449e-05
>>>
> 900 900 900
> 0.248077 5877.2
> 0.689972 2113.13 5.87376e-05
>>> 1000 1000 1000 0.33781
> 5920.49 0.930591 2149.17
> 8.16264e-05
>>> 1100
> 1100 1100 0.5149 5169.93
> 1.25507 2121
> 0.000108883
>>> 1200
> 1200 1200 0.670628 5153.38
> 1.62732 2123.74
> 0.000141876
>>> 1300
> 1300 1300 0.852692 5153.09
> 2.06708 2125.71
> 0.000180926
>>> 1400
> 1400 1400 1.06695 5143.65
> 2.56183 2142.22
> 0.000225975
>>> 1500
> 1500 1500 1.3874
> 4865.2 3.16532
> 2132.49 0.000278553
>>>
> 1600 1600 1600 1.77623 4612.03
> 3.8137 2148.05 0.000338106
>>> 1700 1700 1700
> 2.3773 4133.26
> 4.56665 2151.69
> 0.000404458
>>> 1800
> 1800 1800 3.06381 3807.03
> 5.40317 2158.73 0.00048119
>>> 1900 1900 1900
> 3.9039 3513.92
> 6.37295 2152.53
> 0.000564692
>>> 2000
> 2000 2000 4.79166 3339.13
> 7.43399 2152.28
> 0.000659714
>>> 2100
> 2100 2100 6.04946 3061.76
> 8.62429 2147.65
> 0.000762223
>>> 2200
> 2200 2200 7.39085 2881.4
> 9.86237 2159.32
> 0.000875624
>>> 2300
> 2300 2300 9.00453 2702.42
> 11.2513 2162.78 0.00100184
>>> 2400 2400 2400 10.3952
> 2659.68 12.7491
> 2168.62 0.00113563
>>> 2500
> 2500 2500 12.2283 2555.55
> 14.4615 2160.92 0.00128336
>>> 2600 2600 2600 13.8912
> 2530.51 16.1965
> 2170.34 0.00144304
>>> 2700
> 2700 2700 15.391 2557.72
> 18.1998 2162.99
> 0.00161411
>>> 2800 2800 2800
> 17.5673 2499.19 20.2171
> 2171.63 0.00180035
>>>
> 2900 2900 2900 19.4621 2506.31
> 22.5482 2163.28 0.00199765
>>> 3000 3000 3000 21.4506
> 2517.42 24.9477
> 2164.53 0.00221028
>>> 3100
> 3100 3100 23.71 2512.95
> 27.5144 2165.48 0.00243877
>>> 3200 3200 3200 25.9051
> 2529.85 30.2816
> 2164.22 0.00267766
>>> 3300
> 3300 3300 28.1949 2549.18
> 33.176 2166.45 0.00293379
>>> 3400 3400 3400 30.7235
> 2558.56 36.0156
> 2182.61 0.0032087
>>> 3500 3500 3500 34.0419
> 2518.95 39.3929
> 2176.79 0.00349827
>>> 3600
> 3600 3600 37.0562 2518.12
> 42.7524 2182.62 0.00380447
>>> 3700 3700 3700 39.7885
> 2546.11 46.4748
> 2179.81 0.00412621
>>> 3800
> 3800 3800 43.6607 2513.56
> 50.2119 2185.62
> 0.0044694
>>> 3900
> 3900 3900 46.5104 2550.78
> 54.4822 2177.56 0.00482355
>>> 4000 4000 4000 50.6098
> 2529.15 58.7686
> 2178.03 0.00520289
>>>
>>> tuned run:
>>>
>
>>> $ g++ -Wall -W -std=gnu++11 -Ofast
> -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
> matprod2 matprod2.cc
>>> $
> ./matprod2
>>> # m
> n k uBLAS: t1
>
> MFLOPS Blocked: t2
> MFLOPS Diff nrm1
>>>
> 100 100 100 0.000351671
> 5687.13 0.000316612 6316.88
> 0
>>>
> 200 200 200 0.00419531
> 3813.78 0.00159044 10060.1
> 0
>>>
> 300 300 300 0.0141153
> 3825.62 0.00421113 12823.2
> 1.07645e-06
>>>
> 400 400 400 0.0291599
> 4389.59 0.00858138 14916
> 4.00614e-06
>>>
> 500 500 500 0.0483492
> 5170.72 0.0166519 15013.3
> 8.96808e-06
>>>
> 600 600 600 0.0725783
> 5952.19 0.0279634 15448.7
> 1.63386e-05
>>>
> 700 700 700
> 0.113891 6023.29 0.043077
> 15925 2.69191e-05
>>>
> 800 800 800
> 0.171416 5973.79
> 0.0627796 16311
> 4.09782e-05
>>>
> 900 900 900
> 0.243677 5983.32
> 0.0922766 15800.3
> 5.88092e-05
>>> 1000
> 1000 1000 0.335158 5967.33
> 0.123339 16215.5 8.15988e-05
>>> 1100 1100 1100
> 0.515776 5161.15
> 0.16578 16057.5
> 0.000108991
>>> 1200
> 1200 1200 0.662706 5214.98
> 0.205989 16777.6 0.000141824
>>> 1300 1300 1300
> 0.845952 5194.15
> 0.27637 15899 0.00018111
>>> 1400 1400 1400 1.06712
> 5142.82 0.332118 16524.2
> 0.000225958
>>> 1500
> 1500 1500 1.38147 4886.11
> 0.409224 16494.6 0.000278265
>>> 1600 1600 1600 1.72238
> 4756.21 0.492314 16639.8
> 0.000338095
>>> 1700
> 1700 1700 2.38508 4119.77
> 0.603566 16279.9 0.000404362
>>> 1800 1800 1800 3.12034
> 3738.05 0.717409 16258.5
> 0.000481575
>>> 1900
> 1900 1900 3.93668 3484.66
> 0.824933 16629.2 0.000564727
>>> 2000 2000 2000 4.76038
> 3361.07 0.941643 16991.6
> 0.00065862
>>> 2100 2100 2100
> 5.90627 3135.99 1.12226
> 16504.2 0.000762307
>>> 2200 2200 2200 7.26419
> 2931.64 1.28213
> 16609.9 0.000876699
>>>
> 2300 2300 2300 8.88171 2739.79
> 1.45247 16753.5 0.00100222
>>> 2400 2400 2400 10.4956
> 2634.26 1.62705
> 16992.7 0.00113566
>>> 2500
> 2500 2500 11.913 2623.18
> 1.87499 16666.7
> 0.00128371
>>> 2600 2600 2600
> 13.7057 2564.77 2.1156
> 16615.6 0.00144259
>>> 2700
> 2700 2700 15.5959 2524.13
> 2.33957 16826.1 0.00161501
>>> 2800 2800 2800 17.1121
> 2565.67 2.57445
> 17053.8 0.00179901
>>> 2900
> 2900 2900 19.4167 2512.16
> 2.92445 16679.4 0.00199764
>>> 3000 3000 3000 21.3239
> 2532.37 3.18891
> 16933.7 0.00220999
>>> 3100
> 3100 3100 23.5049 2534.88
> 3.5305 16876.4 0.00243845
>>> 3200 3200 3200 25.7362
> 2546.45 3.81708
> 17169.1 0.00267581
>>> 3300
> 3300 3300 28.4467 2526.62
> 4.25869 16877 0.00293513
>>> 3400 3400 3400 30.4607
> 2580.63 4.67999
> 16796.6 0.00320688
>>> 3500
> 3500 3500 33.7737 2538.96
> 5.04289 17004.1 0.00349667
>>> 3600 3600 3600 36.9633
> 2524.45 5.414
> 17235.3 0.00380237
>>> 3700
> 3700 3700 39.5153 2563.71
> 6.04875 16748.2 0.00412583
>>> 3800 3800 3800 42.9412
> 2555.68 6.48985
> 16910.1 0.00446785
>>> 3900
> 3900 3900 46.5282 2549.81
> 7.05844 16808 0.00482701
>>> 4000 4000 4000 50.2218
> 2548.69 7.42442
> 17240.4 0.00520272
>>>
>>> As the generated machine code is
> completely different, I guess gcc notices the aligned alloc,
> and uses the alignment information for optimisation.
>>>
>>> Cheers,
>>>
>>> Imre
>>>
>>>
>>> On Friday, 22 January 2016, 15:09,
> Michael Lehn <michael.lehn_at_[hidden]>
> wrote:
>>>
>>>
>
>>> Hi Imre,
>>>
>
>>> thanks for running the
> benchmarks. Of course you are right that using aligned
> memory for the buffers improves
>>>
> performance. I also did not really put any effort in
> optimising the parameters MC, NC, KC, MR and NR. I will
>>> compare different variants and report
> them on the website
>>>
>>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>>>
>>> I modified
> my benchmark program such that it also computes the FLOPS
> as
>>>
>>>
> FLOPS = 2*m*n*k/time_elpased
>>>
>>> See
>>>
>>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>>>
>>> Could you
> re-run your benchmarks and post the different MFLOPS you
> get? That is important for actually tuning thing.
>>> On my machine my code only reaches 20%
> of the peak performance (about 5 GFLOPS instead of 25.6
> GFLOPS). So
>>> a speedup
> of 2.5 would be impressive but still far from peak
> performance.
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>> On 22 Jan
> 2016, at 11:03, palik imre <imre_palik_at_[hidden]>
> wrote:
>>>
>>>> Sorry for posting twice more or
> less the same thing. I got confused with javascript
> interfaces.
>>>>
>>>> It seems I also forgot to enable
> avx for my last measurements. With that + my blocking and
> alignment changes, performance according to my tests is
> something like 250% higher than running Michael's
> original code (with avx).
>>>>
>>>> Cheers,
>>>>
>>>>
> Imre
>>>>
>>>>
>>>> On
> Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]>
> wrote:
>>>>
>>>>
>>>> Hi
> Michael,
>>>>
>>>> your blocksizes are far from
> optimal. MR & NR should be multiples of the L1 cache
> line size (i.e. 16 for double on Intel). Also, the blocks
> should be allocated aligned to L1 cache lines (e.g., via
> posix_memalign()).
>>>>
>>>> This alone brought something like
> 50% speedup for my square matrix test.
>>>>
>>>> I
> will have a look at the other parameters + the whole thing
> via perf during the weekend.
>>>>
>
>>>> Cheers,
>>>>
>>>>
> Imre
>>>>
>>>>
>>>>
>>>> On Friday, 22 January 2016, 0:28,
> "ublas-request_at_[hidden]"
> <ublas-request_at_[hidden]>
> wrote:
>>>>
>>>>
>>>>
> Subject: Re: [ublas] Matrix multiplication performance
>>>> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>>>> Content-Type: text/plain;
> charset="windows-1252"
>>>>
>>>> Hi
> Nasos,
>>>>
>>>> first of all I don?t want to take
> wrong credits and want to point out that this is not my
> algorithm. It is based on
>>>>
>>>> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>>
>>>>
> https://github.com/flame/blis
>>>>
>>>> For
> a few cores (4-8) it can easily made multithreaded. For
> many-cores like Intel Xeon Phi this is a bit more
>>>> sophisticated but still not too
> hard. The demo I posted does not use micro kernels that
> exploit SSE, AVX or
>>>> FMA
> instructions. With that the matrix product is on par with
> Intel MKL. Just like BLIS. For my platforms I wrote
>>>> my own micro-kernels but the
> interface of function ugemm is compatible to BLIS.
>>>>
>>>>
> Maybe you could help me to integrate your code in the
> benchmark example I posted above.
>>>>
>>>>
> About Blaze: Do they have their own implementation of a
> matrix-matrix product? It seems to require a
>>>> tuned BLAS implementation
> (?Otherwise you get only poor performance?) for the
> matrix-matrix product.
>>>> IMHO
> they only have tuned the ?easy? stuff like BLAS Level1 and
> Level2. In that case it makes more
>>>> sense to compare the performance
> with the actual underlying GEMM implementation. But if I
> am wrong,
>>>> let me know.
>>>>
>>>>
> About the block size: In my experience you get better
> performance if you chose them dynamically at runtime
>>>> depending on the problem size.
> Depending on the architecture you can just specify ranges
> like 256 - 384 for
>>>> blocking
> factor MC. In my code it then also needs to satisfy the
> restriction that it can be divided by factor MR.
>>>> I know that doing things at
> compile time is a C++ fetish. But the runtime overhead is
> negligible and having
>>>> blocks of
> similar sizes easily pays of.
>>>>
>
>>>> Cheers,
>>>>
>>>>
> Michael
>>>>
> *************************************
>>>>
>>>>
>>>>
>>>>
>>>>
> _______________________________________________
>>>> ublas mailing list
>>>> ublas_at_[hidden]
>>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>> Sent to: michael.lehn_at_[hidden]
>>>
>>>
>>>
>>
>>
> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
>>
>