Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-22 19:41:40


I re-run all the benchmarks with “-Ofast -mavx” (as the hardware on with I generate the
benchmarks does not have AVX2). So now the original ublas::axpy_prod() is actually
doing way better ...

On 22 Jan 2016, at 21:27, Michael Lehn <michael.lehn_at_[hidden]> wrote:

> Wow Imre!
>
> Ok, that is actually a significant difference :-)
>
> I have just added a new version to my site. Unfortunately the computer I used for creating
> the page does not have posix_memalign(). So I had to hack my own function for that. But
> I think for a proof of concept it will do.
>
> But most of all I also added micro-kernel that are (fairly) optimised for AVX and FMA. The
> micro-kernel for AVX require MR=4, NR=8. For FMA it requires MR=4, NR=16. Otherwise
> the reference implementation gets selected. For all parameters default values now can be
> overwritten when compiling, e.g.
>
> g++ -O3 -Wall -std=c++11 -DHAVE_AVX -DBS_D_MC=512 matprod.cc
>
> The optimised micro kernels are only included when compiled with -DHAVE_AVX or -DHAVE_FMA
>
> I put all this stuff here
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>
> also the tar-ball
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>
> contains all required files:
>
> session2/avx.hpp
> session2/fma.hpp
> session2/gemm.hpp
> session2/matprod.cc
>
> Of course I would be interested how all this performs on other platforms.
>
>
> Cheers,
>
> Michael
>
>
> On 22 Jan 2016, at 16:20, palik imre <imre_palik_at_[hidden]> wrote:
>
>> Hi All,
>>
>> In the meantime I enabled avx2 ...
>>
>> Theoretical CPU performance maximum: clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>> So it is at ~25%
>>
>> Compiler is gcc 4.8.3
>>
>> vanilla run:
>> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod matprod.cc
>> $ ./matprod
>> # m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
>> 100 100 100 0.000350402 5707.73 0.00116104 1722.59 0
>> 200 200 200 0.00445094 3594.74 0.00819996 1951.23 0
>> 300 300 300 0.0138515 3898.49 0.0266515 2026.15 1.06987e-06
>> 400 400 400 0.0266447 4803.96 0.0613506 2086.37 4.01475e-06
>> 500 500 500 0.0424372 5891.06 0.119345 2094.77 8.99605e-06
>> 600 600 600 0.0724648 5961.51 0.203187 2126.12 1.63618e-05
>> 700 700 700 0.115464 5941.25 0.325834 2105.36 2.69547e-05
>> 800 800 800 0.173003 5918.98 0.480655 2130.43 4.09449e-05
>> 900 900 900 0.248077 5877.2 0.689972 2113.13 5.87376e-05
>> 1000 1000 1000 0.33781 5920.49 0.930591 2149.17 8.16264e-05
>> 1100 1100 1100 0.5149 5169.93 1.25507 2121 0.000108883
>> 1200 1200 1200 0.670628 5153.38 1.62732 2123.74 0.000141876
>> 1300 1300 1300 0.852692 5153.09 2.06708 2125.71 0.000180926
>> 1400 1400 1400 1.06695 5143.65 2.56183 2142.22 0.000225975
>> 1500 1500 1500 1.3874 4865.2 3.16532 2132.49 0.000278553
>> 1600 1600 1600 1.77623 4612.03 3.8137 2148.05 0.000338106
>> 1700 1700 1700 2.3773 4133.26 4.56665 2151.69 0.000404458
>> 1800 1800 1800 3.06381 3807.03 5.40317 2158.73 0.00048119
>> 1900 1900 1900 3.9039 3513.92 6.37295 2152.53 0.000564692
>> 2000 2000 2000 4.79166 3339.13 7.43399 2152.28 0.000659714
>> 2100 2100 2100 6.04946 3061.76 8.62429 2147.65 0.000762223
>> 2200 2200 2200 7.39085 2881.4 9.86237 2159.32 0.000875624
>> 2300 2300 2300 9.00453 2702.42 11.2513 2162.78 0.00100184
>> 2400 2400 2400 10.3952 2659.68 12.7491 2168.62 0.00113563
>> 2500 2500 2500 12.2283 2555.55 14.4615 2160.92 0.00128336
>> 2600 2600 2600 13.8912 2530.51 16.1965 2170.34 0.00144304
>> 2700 2700 2700 15.391 2557.72 18.1998 2162.99 0.00161411
>> 2800 2800 2800 17.5673 2499.19 20.2171 2171.63 0.00180035
>> 2900 2900 2900 19.4621 2506.31 22.5482 2163.28 0.00199765
>> 3000 3000 3000 21.4506 2517.42 24.9477 2164.53 0.00221028
>> 3100 3100 3100 23.71 2512.95 27.5144 2165.48 0.00243877
>> 3200 3200 3200 25.9051 2529.85 30.2816 2164.22 0.00267766
>> 3300 3300 3300 28.1949 2549.18 33.176 2166.45 0.00293379
>> 3400 3400 3400 30.7235 2558.56 36.0156 2182.61 0.0032087
>> 3500 3500 3500 34.0419 2518.95 39.3929 2176.79 0.00349827
>> 3600 3600 3600 37.0562 2518.12 42.7524 2182.62 0.00380447
>> 3700 3700 3700 39.7885 2546.11 46.4748 2179.81 0.00412621
>> 3800 3800 3800 43.6607 2513.56 50.2119 2185.62 0.0044694
>> 3900 3900 3900 46.5104 2550.78 54.4822 2177.56 0.00482355
>> 4000 4000 4000 50.6098 2529.15 58.7686 2178.03 0.00520289
>>
>> tuned run:
>>
>> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2 matprod2.cc
>> $ ./matprod2
>> # m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
>> 100 100 100 0.000351671 5687.13 0.000316612 6316.88 0
>> 200 200 200 0.00419531 3813.78 0.00159044 10060.1 0
>> 300 300 300 0.0141153 3825.62 0.00421113 12823.2 1.07645e-06
>> 400 400 400 0.0291599 4389.59 0.00858138 14916 4.00614e-06
>> 500 500 500 0.0483492 5170.72 0.0166519 15013.3 8.96808e-06
>> 600 600 600 0.0725783 5952.19 0.0279634 15448.7 1.63386e-05
>> 700 700 700 0.113891 6023.29 0.043077 15925 2.69191e-05
>> 800 800 800 0.171416 5973.79 0.0627796 16311 4.09782e-05
>> 900 900 900 0.243677 5983.32 0.0922766 15800.3 5.88092e-05
>> 1000 1000 1000 0.335158 5967.33 0.123339 16215.5 8.15988e-05
>> 1100 1100 1100 0.515776 5161.15 0.16578 16057.5 0.000108991
>> 1200 1200 1200 0.662706 5214.98 0.205989 16777.6 0.000141824
>> 1300 1300 1300 0.845952 5194.15 0.27637 15899 0.00018111
>> 1400 1400 1400 1.06712 5142.82 0.332118 16524.2 0.000225958
>> 1500 1500 1500 1.38147 4886.11 0.409224 16494.6 0.000278265
>> 1600 1600 1600 1.72238 4756.21 0.492314 16639.8 0.000338095
>> 1700 1700 1700 2.38508 4119.77 0.603566 16279.9 0.000404362
>> 1800 1800 1800 3.12034 3738.05 0.717409 16258.5 0.000481575
>> 1900 1900 1900 3.93668 3484.66 0.824933 16629.2 0.000564727
>> 2000 2000 2000 4.76038 3361.07 0.941643 16991.6 0.00065862
>> 2100 2100 2100 5.90627 3135.99 1.12226 16504.2 0.000762307
>> 2200 2200 2200 7.26419 2931.64 1.28213 16609.9 0.000876699
>> 2300 2300 2300 8.88171 2739.79 1.45247 16753.5 0.00100222
>> 2400 2400 2400 10.4956 2634.26 1.62705 16992.7 0.00113566
>> 2500 2500 2500 11.913 2623.18 1.87499 16666.7 0.00128371
>> 2600 2600 2600 13.7057 2564.77 2.1156 16615.6 0.00144259
>> 2700 2700 2700 15.5959 2524.13 2.33957 16826.1 0.00161501
>> 2800 2800 2800 17.1121 2565.67 2.57445 17053.8 0.00179901
>> 2900 2900 2900 19.4167 2512.16 2.92445 16679.4 0.00199764
>> 3000 3000 3000 21.3239 2532.37 3.18891 16933.7 0.00220999
>> 3100 3100 3100 23.5049 2534.88 3.5305 16876.4 0.00243845
>> 3200 3200 3200 25.7362 2546.45 3.81708 17169.1 0.00267581
>> 3300 3300 3300 28.4467 2526.62 4.25869 16877 0.00293513
>> 3400 3400 3400 30.4607 2580.63 4.67999 16796.6 0.00320688
>> 3500 3500 3500 33.7737 2538.96 5.04289 17004.1 0.00349667
>> 3600 3600 3600 36.9633 2524.45 5.414 17235.3 0.00380237
>> 3700 3700 3700 39.5153 2563.71 6.04875 16748.2 0.00412583
>> 3800 3800 3800 42.9412 2555.68 6.48985 16910.1 0.00446785
>> 3900 3900 3900 46.5282 2549.81 7.05844 16808 0.00482701
>> 4000 4000 4000 50.2218 2548.69 7.42442 17240.4 0.00520272
>>
>> As the generated machine code is completely different, I guess gcc notices the aligned alloc, and uses the alignment information for optimisation.
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 15:09, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>>
>>
>> Hi Imre,
>>
>> thanks for running the benchmarks. Of course you are right that using aligned memory for the buffers improves
>> performance. I also did not really put any effort in optimising the parameters MC, NC, KC, MR and NR. I will
>> compare different variants and report them on the website
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>>
>> I modified my benchmark program such that it also computes the FLOPS as
>>
>> FLOPS = 2*m*n*k/time_elpased
>>
>> See
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>>
>> Could you re-run your benchmarks and post the different MFLOPS you get? That is important for actually tuning thing.
>> On my machine my code only reaches 20% of the peak performance (about 5 GFLOPS instead of 25.6 GFLOPS). So
>> a speedup of 2.5 would be impressive but still far from peak performance.
>>
>> Cheers,
>>
>> Michael
>>
>>
>> On 22 Jan 2016, at 11:03, palik imre <imre_palik_at_[hidden]> wrote:
>>
>>> Sorry for posting twice more or less the same thing. I got confused with javascript interfaces.
>>>
>>> It seems I also forgot to enable avx for my last measurements. With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).
>>>
>>> Cheers,
>>>
>>> Imre
>>>
>>>
>>> On Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]> wrote:
>>>
>>>
>>> Hi Michael,
>>>
>>> your blocksizes are far from optimal. MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel). Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).
>>>
>>> This alone brought something like 50% speedup for my square matrix test.
>>>
>>> I will have a look at the other parameters + the whole thing via perf during the weekend.
>>>
>>> Cheers,
>>>
>>> Imre
>>>
>>>
>>>
>>> On Friday, 22 January 2016, 0:28, "ublas-request_at_[hidden]" <ublas-request_at_[hidden]> wrote:
>>>
>>>
>>> Subject: Re: [ublas] Matrix multiplication performance
>>> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>>> Content-Type: text/plain; charset="windows-1252"
>>>
>>> Hi Nasos,
>>>
>>> first of all I don?t want to take wrong credits and want to point out that this is not my algorithm. It is based on
>>>
>>> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>
>>> https://github.com/flame/blis
>>>
>>> For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more
>>> sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or
>>> FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
>>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>>>
>>> Maybe you could help me to integrate your code in the benchmark example I posted above.
>>>
>>> About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a
>>> tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
>>> IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2. In that case it makes more
>>> sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong,
>>> let me know.
>>>
>>> About the block size: In my experience you get better performance if you chose them dynamically at runtime
>>> depending on the problem size. Depending on the architecture you can just specify ranges like 256 - 384 for
>>> blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
>>> I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having
>>> blocks of similar sizes easily pays of.
>>>
>>> Cheers,
>>>
>>> Michael
>>> *************************************
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: michael.lehn_at_[hidden]
>>
>>
>>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]
>