Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Riccardo Rossi (rrossi_at_[hidden])
Date: 2016-01-24 04:46:07


isn't the command

alignof(...)

of c++11 what you are looking for?

cheers
Riccardo

On Sun, Jan 24, 2016 at 8:50 AM, palik imre <imre_palik_at_[hidden]> wrote:

> Hi Michael,
>
> I had a look on your AVX microkernel on my old AMD box. Congratulations,
> it is twice as fast as what I managed to get out of my optimised C kernel.
>
> I wonder if I can catch up with that using newer gcc + SIMD arrays or
> intrinsics.
>
> Cheers,
>
> Imre
>
> --------------------------------------------
> On Fri, 22/1/16, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>
> Subject: Re: [ublas] Matrix multiplication performance
> To: "palik imre" <imre_palik_at_[hidden]>
> Cc: "ublas mailing list" <ublas_at_[hidden]>
> Date: Friday, 22 January, 2016, 21:27
>
> Wow Imre!
>
> Ok, that is actually a
> significant difference :-)
>
> I have just added a new version to my site.
> Unfortunately the computer I used for creating
> the page does not have posix_memalign(). So I
> had to hack my own function for that. But
> I think for a proof of concept it will do.
>
> But most of all I also added
> micro-kernel that are (fairly) optimised for AVX and FMA.
> The
> micro-kernel for AVX require MR=4,
> NR=8. For FMA it requires MR=4, NR=16. Otherwise
> the reference implementation gets selected.
> For all parameters default values now can be
> overwritten when compiling, e.g.
>
> g++ -O3 -Wall
> -std=c++11 -DHAVE_AVX -DBS_D_MC=512 matprod.cc
>
> The optimised micro kernels
> are only included when compiled with -DHAVE_AVX or
> -DHAVE_FMA
>
> I put all this
> stuff here
>
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>
> also the tar-ball
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>
> contains all required
> files:
>
>
> session2/avx.hpp
> session2/fma.hpp
> session2/gemm.hpp
>
> session2/matprod.cc
>
> Of
> course I would be interested how all this performs on other
> platforms.
>
>
> Cheers,
>
> Michael
>
>
> On 22
> Jan 2016, at 16:20, palik imre <imre_palik_at_[hidden]>
> wrote:
>
> > Hi All,
> >
> > In the meantime I
> enabled avx2 ...
> >
> >
> Theoretical CPU performance maximum: clock * vector-size *
> ALUs/Core * 2 ~ 69.6GFLOPS
> > So it is at
> ~25%
> >
> > Compiler is
> gcc 4.8.3
> >
> > vanilla
> run:
> > $ g++ -Wall -W -std=gnu++11 -Ofast
> -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
> matprod matprod.cc
> > $ ./matprod
> > # m n
> k uBLAS: t1
> MFLOPS Blocked: t2
> MFLOPS Diff nrm1
> > 100 100 100
> 0.000350402 5707.73 0.00116104
> 1722.59 0
> > 200 200 200 0.00445094
> 3594.74 0.00819996 1951.23
> 0
> > 300 300 300
> 0.0138515 3898.49 0.0266515
> 2026.15 1.06987e-06
> > 400 400 400
> 0.0266447 4803.96 0.0613506
> 2086.37 4.01475e-06
> > 500 500 500
> 0.0424372 5891.06 0.119345
> 2094.77 8.99605e-06
> > 600 600 600
> 0.0724648 5961.51 0.203187
> 2126.12 1.63618e-05
> > 700 700 700
> 0.115464 5941.25 0.325834
> 2105.36 2.69547e-05
> > 800 800 800
> 0.173003 5918.98 0.480655
> 2130.43 4.09449e-05
> > 900 900 900
> 0.248077 5877.2
> 0.689972 2113.13 5.87376e-05
> > 1000 1000 1000 0.33781
> 5920.49 0.930591 2149.17
> 8.16264e-05
> > 1100
> 1100 1100 0.5149 5169.93
> 1.25507 2121
> 0.000108883
> > 1200
> 1200 1200 0.670628 5153.38
> 1.62732 2123.74
> 0.000141876
> > 1300
> 1300 1300 0.852692 5153.09
> 2.06708 2125.71
> 0.000180926
> > 1400
> 1400 1400 1.06695 5143.65
> 2.56183 2142.22
> 0.000225975
> > 1500
> 1500 1500 1.3874
> 4865.2 3.16532
> 2132.49 0.000278553
> >
> 1600 1600 1600 1.77623 4612.03
> 3.8137 2148.05 0.000338106
> > 1700 1700 1700
> 2.3773 4133.26
> 4.56665 2151.69
> 0.000404458
> > 1800
> 1800 1800 3.06381 3807.03
> 5.40317 2158.73 0.00048119
> > 1900 1900 1900
> 3.9039 3513.92
> 6.37295 2152.53
> 0.000564692
> > 2000
> 2000 2000 4.79166 3339.13
> 7.43399 2152.28
> 0.000659714
> > 2100
> 2100 2100 6.04946 3061.76
> 8.62429 2147.65
> 0.000762223
> > 2200
> 2200 2200 7.39085 2881.4
> 9.86237 2159.32
> 0.000875624
> > 2300
> 2300 2300 9.00453 2702.42
> 11.2513 2162.78 0.00100184
> > 2400 2400 2400 10.3952
> 2659.68 12.7491 2168.62
> 0.00113563
> > 2500 2500 2500
> 12.2283 2555.55
> 14.4615 2160.92 0.00128336
> > 2600 2600 2600 13.8912
> 2530.51 16.1965 2170.34
> 0.00144304
> > 2700 2700 2700
> 15.391 2557.72
> 18.1998 2162.99 0.00161411
> > 2800 2800 2800 17.5673
> 2499.19 20.2171 2171.63
> 0.00180035
> > 2900 2900 2900
> 19.4621 2506.31
> 22.5482 2163.28 0.00199765
> > 3000 3000 3000 21.4506
> 2517.42 24.9477 2164.53
> 0.00221028
> > 3100 3100 3100
> 23.71 2512.95
> 27.5144 2165.48 0.00243877
> > 3200 3200 3200 25.9051
> 2529.85 30.2816 2164.22
> 0.00267766
> > 3300 3300 3300
> 28.1949 2549.18 33.176
> 2166.45 0.00293379
> > 3400
> 3400 3400 30.7235 2558.56
> 36.0156 2182.61
> 0.0032087
> > 3500 3500
> 3500 34.0419 2518.95
> 39.3929 2176.79 0.00349827
> > 3600 3600 3600 37.0562
> 2518.12 42.7524 2182.62
> 0.00380447
> > 3700 3700 3700
> 39.7885 2546.11
> 46.4748 2179.81 0.00412621
> > 3800 3800 3800 43.6607
> 2513.56 50.2119 2185.62
> 0.0044694
> > 3900
> 3900 3900 46.5104 2550.78
> 54.4822 2177.56 0.00482355
> > 4000 4000 4000 50.6098
> 2529.15 58.7686 2178.03
> 0.00520289
> >
> >
> tuned run:
> >
> > $ g++
> -Wall -W -std=gnu++11 -Ofast -march=core-avx2
> -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2
> matprod2.cc
> > $ ./matprod2
> > # m n
> k uBLAS: t1
> MFLOPS Blocked: t2
> MFLOPS Diff nrm1
> > 100 100 100
> 0.000351671 5687.13 0.000316612
> 6316.88 0
> > 200 200 200 0.00419531
> 3813.78 0.00159044 10060.1
> 0
> > 300 300 300
> 0.0141153 3825.62 0.00421113
> 12823.2 1.07645e-06
> > 400 400 400
> 0.0291599 4389.59 0.00858138
> 14916 4.00614e-06
> > 500 500 500
> 0.0483492 5170.72 0.0166519
> 15013.3 8.96808e-06
> > 600 600 600
> 0.0725783 5952.19 0.0279634
> 15448.7 1.63386e-05
> > 700 700 700
> 0.113891 6023.29 0.043077
> 15925 2.69191e-05
> > 800 800 800
> 0.171416 5973.79
> 0.0627796 16311
> 4.09782e-05
> > 900 900 900
> 0.243677 5983.32
> 0.0922766 15800.3
> 5.88092e-05
> > 1000
> 1000 1000 0.335158 5967.33
> 0.123339 16215.5 8.15988e-05
> > 1100 1100 1100
> 0.515776 5161.15
> 0.16578 16057.5
> 0.000108991
> > 1200
> 1200 1200 0.662706 5214.98
> 0.205989 16777.6 0.000141824
> > 1300 1300 1300
> 0.845952 5194.15
> 0.27637 15899 0.00018111
> > 1400 1400 1400 1.06712
> 5142.82 0.332118 16524.2
> 0.000225958
> > 1500
> 1500 1500 1.38147 4886.11
> 0.409224 16494.6 0.000278265
> > 1600 1600 1600 1.72238
> 4756.21 0.492314 16639.8
> 0.000338095
> > 1700
> 1700 1700 2.38508 4119.77
> 0.603566 16279.9 0.000404362
> > 1800 1800 1800 3.12034
> 3738.05 0.717409 16258.5
> 0.000481575
> > 1900
> 1900 1900 3.93668 3484.66
> 0.824933 16629.2 0.000564727
> > 2000 2000 2000 4.76038
> 3361.07 0.941643 16991.6
> 0.00065862
> > 2100 2100 2100
> 5.90627 3135.99 1.12226
> 16504.2 0.000762307
> > 2200 2200 2200 7.26419
> 2931.64 1.28213 16609.9
> 0.000876699
> > 2300
> 2300 2300 8.88171 2739.79
> 1.45247 16753.5 0.00100222
> > 2400 2400 2400 10.4956
> 2634.26 1.62705 16992.7
> 0.00113566
> > 2500 2500 2500
> 11.913 2623.18
> 1.87499 16666.7 0.00128371
> > 2600 2600 2600 13.7057
> 2564.77 2.1156 16615.6
> 0.00144259
> > 2700 2700 2700
> 15.5959 2524.13 2.33957
> 16826.1 0.00161501
> > 2800
> 2800 2800 17.1121 2565.67
> 2.57445 17053.8 0.00179901
> > 2900 2900 2900 19.4167
> 2512.16 2.92445 16679.4
> 0.00199764
> > 3000 3000 3000
> 21.3239 2532.37
> 3.18891 16933.7 0.00220999
> > 3100 3100 3100 23.5049
> 2534.88 3.5305 16876.4
> 0.00243845
> > 3200 3200 3200
> 25.7362 2546.45 3.81708
> 17169.1 0.00267581
> > 3300
> 3300 3300 28.4467 2526.62
> 4.25869 16877 0.00293513
> > 3400 3400 3400 30.4607
> 2580.63 4.67999 16796.6
> 0.00320688
> > 3500 3500 3500
> 33.7737 2538.96
> 5.04289 17004.1 0.00349667
> > 3600 3600 3600 36.9633
> 2524.45 5.414 17235.3
> 0.00380237
> > 3700 3700 3700
> 39.5153 2563.71
> 6.04875 16748.2 0.00412583
> > 3800 3800 3800 42.9412
> 2555.68 6.48985 16910.1
> 0.00446785
> > 3900 3900 3900
> 46.5282 2549.81
> 7.05844 16808 0.00482701
> > 4000 4000 4000 50.2218
> 2548.69 7.42442 17240.4
> 0.00520272
> >
> >
> As the generated machine code is completely different, I
> guess gcc notices the aligned alloc, and uses the alignment
> information for optimisation.
> >
> > Cheers,
> >
> > Imre
> >
> >
> > On Friday, 22
> January 2016, 15:09, Michael Lehn <michael.lehn_at_[hidden]>
> wrote:
> >
> >
> > Hi Imre,
> >
> > thanks for running the benchmarks. Of
> course you are right that using aligned memory for the
> buffers improves
> > performance. I also
> did not really put any effort in optimising the parameters
> MC, NC, KC, MR and NR. I will
> > compare
> different variants and report them on the website
> >
> > http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
> >
> > I modified my
> benchmark program such that it also computes the FLOPS as
> >
> > FLOPS =
> 2*m*n*k/time_elpased
> >
> > See
> >
> >
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
> >
> > Could you re-run
> your benchmarks and post the different MFLOPS you get? That
> is important for actually tuning thing.
> >
> On my machine my code only reaches 20% of the peak
> performance (about 5 GFLOPS instead of 25.6
> GFLOPS). So
> > a speedup of
> 2.5 would be impressive but still far from peak
> performance.
> >
> >
> Cheers,
> >
> >
> Michael
> >
> >
> > On 22 Jan 2016, at 11:03, palik imre
> <imre_palik_at_[hidden]>
> wrote:
> >
> >> Sorry
> for posting twice more or less the same thing. I got
> confused with javascript interfaces.
> >>
> >> It seems I
> also forgot to enable avx for my last measurements. With
> that + my blocking and alignment changes, performance
> according to my tests is something like 250% higher than
> running Michael's original code (with avx).
> >>
> >> Cheers,
> >>
> >> Imre
> >>
> >>
> >> On Friday, 22 January 2016, 10:33,
> palik imre <imre_palik_at_[hidden]>
> wrote:
> >>
> >>
>
> >> Hi Michael,
> >>
> >> your
> blocksizes are far from optimal. MR & NR should be
> multiples of the L1 cache line size (i.e. 16 for double on
> Intel). Also, the blocks should be allocated aligned to L1
> cache lines (e.g., via posix_memalign()).
> >>
> >> This alone
> brought something like 50% speedup for my square matrix
> test.
> >>
> >> I
> will have a look at the other parameters + the whole thing
> via perf during the weekend.
> >>
> >> Cheers,
> >>
> >> Imre
> >>
> >>
> >>
> >> On Friday, 22 January 2016, 0:28,
> "ublas-request_at_[hidden]"
> <ublas-request_at_[hidden]>
> wrote:
> >>
> >>
>
> >> Subject: Re: [ublas] Matrix
> multiplication performance
> >>
> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
> >> Content-Type: text/plain;
> charset="windows-1252"
> >>
>
> >> Hi Nasos,
> >>
>
> >> first of all I don?t want to take
> wrong credits and want to point out that this is not my
> algorithm. It is based on
> >>
> >> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
> >>
> >>
> https://github.com/flame/blis
> >>
> >> For a few
> cores (4-8) it can easily made multithreaded. For
> many-cores like Intel Xeon Phi this is a bit more
> >> sophisticated but still not too
> hard. The demo I posted does not use micro kernels that
> exploit SSE, AVX or
> >> FMA
> instructions. With that the matrix product is on par with
> Intel MKL. Just like BLIS. For my platforms I wrote
> >> my own micro-kernels but the interface
> of function ugemm is compatible to BLIS.
> >>
> >> Maybe you
> could help me to integrate your code in the benchmark
> example I posted above.
> >>
> >> About Blaze: Do they have their own
> implementation of a matrix-matrix product? It seems to
> require a
> >> tuned BLAS implementation
> (?Otherwise you get only poor performance?) for the
> matrix-matrix product.
> >> IMHO they
> only have tuned the ?easy? stuff like BLAS Level1 and
> Level2. In that case it makes more
> >> sense to compare the performance with
> the actual underlying GEMM implementation. But if I am
> wrong,
> >> let me know.
> >>
> >> About the
> block size: In my experience you get better performance if
> you chose them dynamically at runtime
> >> depending on the problem size.
> Depending on the architecture you can just specify ranges
> like 256 - 384 for
> >> blocking factor
> MC. In my code it then also needs to satisfy the
> restriction that it can be divided by factor MR.
> >> I know that doing things at compile
> time is a C++ fetish. But the runtime overhead is
> negligible and having
> >> blocks of
> similar sizes easily pays of.
> >>
> >> Cheers,
> >>
> >> Michael
> >>
> *************************************
> >>
> >>
> >>
> >>
> >>
> _______________________________________________
> >> ublas mailing list
> >> ublas_at_[hidden]
> >> http://lists.boost.org/mailman/listinfo.cgi/ublas
> >> Sent to: michael.lehn_at_[hidden]
> >
> >
> >
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: rrossi_at_[hidden]

-- 
*Riccardo Rossi*
PhD, Civil Engineer
member of the Kratos Team: www.cimne.com/kratos
lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
Research fellow at International Center for Numerical Methods in
Engineering (CIMNE)
C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
08034 – Barcelona – Spain – www.cimne.com  -
T.(+34) 93 401 56 96 skype: *rougered4*
<http://www.cimne.com/>
<https://www.facebook.com/cimne> <http://blog.cimne.com/>
<http://vimeo.com/cimne> <http://www.youtube.com/user/CIMNEvideos>
<http://www.linkedin.com/company/cimne> <https://twitter.com/cimne>
Les dades personals contingudes en aquest missatge són tractades amb la
finalitat de mantenir el contacte professional entre CIMNE i voste. Podra
exercir els drets d'accés, rectificació, cancel·lació i oposició,
dirigint-se a cimne_at_cimne.upc.edu. La utilització de la seva adreça de
correu electronic per part de CIMNE queda subjecte a les disposicions de la
Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç
Electronic.
 Imprimiu aquest missatge, només si és estrictament necessari.
<http://www.cimne.com/>