|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 2016-01-29 02:50:44
Hi,
I would like to contribute with some Benchmarks as well. Is the code
available for testing?
Best,
Oswin
On 2016-01-28 19:49, Michael Lehn wrote:
> In the meantime some results from my Haswell machine. It has 4 quad
> cores. But there are
> other jobs running so I went up to 8 threads. But anyway, the
> parallelisation is simple for
> the maximal matrix dimension N=M=K=4000 it reaches
>
> 1) 32.9 GFLOPS with 1 thread
> 2) 63 GFLOPS with 2 threads
> 3) 104.6 GFLOPS with 4 threads
> 4) 180.5 GFLOPS with 8 threads
>
> that is ok for a simple implementation but can be done better. Most
> of all it takes much too long (or
> much to big problem sizes to scale well). But for the moment we
> should focus on a good single threaded
> implementation and do the parallel stuff the right way later. As this
> will require more than just a single
> #pragma omp parallel for
>
>
>
> [lehn_at_node042 session4]$ g++ -Ofast -Wall -std=c++11 -DNDEBUG
> -DHAVE_FMA -I ../boost_1_60_0/ -fopenmp matprod.cc
> [lehn_at_node042 session4]$ export OMP_NUM_THREADS=1; ./a.out
> # m n k uBLAS: t1 MFLOPS Blocked: t2
> MFLOPS Diff nrm1
> 100 100 100 0.00119632 1671.79 0.00089036
> 2246.28 3.90562e-14
> 200 200 200 0.00322943 4954.44 0.00082579
> 19375.4 1.50135e-15
> 300 300 300 0.0108177 4991.81 0.00221283
> 24403.1 2.18434e-16
> 400 400 400 0.0247278 5176.35 0.00429661
> 29790.9 5.58593e-17
> 500 500 500 0.053677 4657.49 0.00822185
> 30406.8 1.94899e-17
> 600 600 600 0.0820133 5267.44 0.0136631
> 31617.9 1.16524e-17
> 700 700 700 0.129231 5308.34 0.0208619
> 32882.9 6.82385e-18
> 800 800 800 0.19206 5331.67 0.0309358
> 33100.8 4.08617e-18
> 900 900 900 0.272354 5353.34 0.0430091
> 33899.8 2.54117e-18
> 1000 1000 1000 0.372831 5364.36 0.0582482
> 34335.8 1.64011e-18
> 1100 1100 1100 0.494906 5378.8 0.0796676
> 33413.8 1.08587e-18
> 1200 1200 1200 0.642926 5375.43 0.098814
> 34974.8 7.43828e-19
> 1300 1300 1300 0.815164 5390.32 0.125541
> 35000.5 5.26152e-19
> 1400 1400 1400 1.04147 5269.48 0.154808
> 35450.4 3.81507e-19
> 1500 1500 1500 1.24516 5420.99 0.187327
> 36033.2 2.82388e-19
> 1600 1600 1600 1.5581 5257.68 0.236257
> 34674 2.12031e-19
> 1700 1700 1700 2.57574 3814.82 0.273446
> 35933.9 1.61384e-19
> 1800 1800 1800 3.24948 3589.5 0.319974
> 36453 1.25033e-19
> 1900 1900 1900 4.01719 3414.82 0.378235
> 36268.4 9.86666e-20
> 2000 2000 2000 4.82997 3312.65 0.438886
> 36456 7.86863e-20
> 2100 2100 2100 5.88206 3148.89 0.517821
> 35769.1 6.31726e-20
> 2200 2200 2200 6.87358 3098.24 0.590235
> 36080.6 5.1152e-20
> 2300 2300 2300 8.08021 3011.55 0.659934
> 36873.4 4.19219e-20
> 2400 2400 2400 9.31063 2969.51 0.748285
> 36948.5 3.46865e-20
> 2500 2500 2500 10.5343 2966.51 0.84448
> 37005 2.88942e-20
> 2600 2600 2600 11.8768 2959.71 0.984227
> 35715.3 2.42294e-20
> 2700 2700 2700 13.3378 2951.45 1.06838
> 36846.5 2.04036e-20
> 2800 2800 2800 14.9304 2940.57 1.18762
> 36968.2 1.73201e-20
> 2900 2900 2900 16.8965 2886.87 1.33445
> 36552.8 1.47904e-20
> 3000 3000 3000 18.7376 2881.9 1.49449
> 36132.7 1.27205e-20
> 3100 3100 3100 20.8439 2858.48 1.66163
> 35857.5 1.09759e-20
> 3200 3200 3200 22.9032 2861.44 1.82771
> 35856.9 9.49415e-21
> 3300 3300 3300 28.2407 2545.05 2.08438
> 34482.2 8.25868e-21
> 3400 3400 3400 27.5374 2854.6 2.18449
> 35984.7 7.22064e-21
> 3500 3500 3500 29.925 2865.5 2.34372
> 36587.1 6.34137e-21
> 3600 3600 3600 32.6588 2857.17 2.56586
> 36366.7 5.5874e-21
> 3700 3700 3700 34.5032 2936.14 2.77154
> 36552.2 4.92873e-21
> 3800 3800 3800 36.9099 2973.29 2.97732
> 36860.1 4.36811e-21
> 3900 3900 3900 44.6497 2657.09 3.24271
> 36586.1 3.88313e-21
> 4000 4000 4000 56.9767 2246.53 3.88046
> 32985.8 3.46672e-21
> [lehn_at_node042 session4]$ export OMP_NUM_THREADS=2; ./a.out
> # m n k uBLAS: t1 MFLOPS Blocked: t2
> MFLOPS Diff nrm1
> 100 100 100 0.00120386 1661.33 0.000876976
> 2280.56 3.95867e-14
> 200 200 200 0.00323702 4942.82 0.00099518
> 16077.5 1.50256e-15
> 300 300 300 0.0106352 5077.5 0.00286667
> 18837.2 2.19644e-16
> 400 400 400 0.0247765 5166.19 0.00610925
> 20951.8 5.61969e-17
> 500 500 500 0.0478359 5226.2 0.00707235
> 35348.9 1.94268e-17
> 600 600 600 0.082058 5264.57 0.0108406
> 39850.3 1.16982e-17
> 700 700 700 0.129637 5291.71 0.0170924
> 40134.8 6.8281e-18
> 800 800 800 0.1925 5319.48 0.0214161
> 47814.4 4.09348e-18
> 900 900 900 0.273022 5340.22 0.0298684
> 48814.2 2.54562e-18
> 1000 1000 1000 0.373113 5360.3 0.0417747
> 47875.9 1.64027e-18
> 1100 1100 1100 0.499034 5334.3 0.0527302
> 50483.4 1.08356e-18
> 1200 1200 1200 0.64351 5370.55 0.0624654
> 55326.6 7.44302e-19
> 1300 1300 1300 0.829601 5296.52 0.0793488
> 55375.8 5.25547e-19
> 1400 1400 1400 1.13615 4830.35 0.0937135
> 58561.5 3.8117e-19
> 1500 1500 1500 1.38215 4883.71 0.11078
> 60931.4 2.82628e-19
> 1600 1600 1600 2.34569 3492.37 0.148535
> 55152.1 2.11636e-19
> 1700 1700 1700 2.80764 3499.73 0.166754
> 58925.2 1.61617e-19
> 1800 1800 1800 3.65597 3190.4 0.183227
> 63658.6 1.25225e-19
> 1900 1900 1900 6.04791 2268.22 0.229272
> 59832.8 9.8624e-20
> 2000 2000 2000 5.41562 2954.41 0.244907
> 65331 7.8709e-20
> 2100 2100 2100 5.79329 3197.15 0.320638
> 57766.1 6.31124e-20
> 2200 2200 2200 10.1105 2106.32 0.348126
> 61173.2 5.11424e-20
> 2300 2300 2300 11.746 2071.68 0.385373
> 63144 4.18844e-20
> 2400 2400 2400 13.4099 2061.77 0.438608
> 63035.8 3.46829e-20
> 2500 2500 2500 14.8645 2102.32 0.491434
> 63589.4 2.88839e-20
> 2600 2600 2600 17.1602 2048.46 0.550163
> 63893.8 2.42378e-20
> 2700 2700 2700 19.24 2046.05 0.616314
> 63873.3 2.03993e-20
> 2800 2800 2800 14.8633 2953.85 0.675975
> 64949.2 1.73082e-20
> 2900 2900 2900 18.533 2631.96 0.72636
> 67154.1 1.47984e-20
> 3000 3000 3000 18.2701 2955.64 0.804625
> 67112 1.27211e-20
> 3100 3100 3100 20.2371 2944.19 0.938507
> 63485.9 1.09831e-20
> 3200 3200 3200 22.6838 2889.11 1.07581
> 60918.1 9.49232e-21
> 3300 3300 3300 25.0228 2872.33 1.06473
> 67504.6 8.25942e-21
> 3400 3400 3400 27.3561 2873.51 1.16247
> 67621.6 7.21511e-21
> 3500 3500 3500 29.7889 2878.59 1.32098
> 64913.8 6.3385e-21
> 3600 3600 3600 34.8098 2680.62 1.37908
> 67662.7 5.58738e-21
> 3700 3700 3700 37.6151 2693.23 1.52253
> 66538.1 4.92976e-21
> 3800 3800 3800 38.99 2814.67 1.63282
> 67211.5 4.36537e-21
> 3900 3900 3900 57.5765 2060.53 1.75221
> 67707.4 3.88246e-21
> 4000 4000 4000 51.1335 2503.25 2.03062
> 63035.1 3.46549e-21
> [lehn_at_node042 session4]$ export OMP_NUM_THREADS=4; ./a.out
> # m n k uBLAS: t1 MFLOPS Blocked: t2
> MFLOPS Diff nrm1
> 100 100 100 0.00119733 1670.39 0.00124331
> 1608.61 3.84618e-14
> 200 200 200 0.00427996 3738.35 0.000965206
> 16576.8 1.47604e-15
> 300 300 300 0.0146617 3683.06 0.00235442
> 22935.6 2.18643e-16
> 400 400 400 0.0301558 4244.62 0.00431311
> 29677 5.57089e-17
> 500 500 500 0.0509763 4904.24 0.00541684
> 46152.4 1.94817e-17
> 600 600 600 0.0823676 5244.78 0.00815973
> 52943 1.16851e-17
> 700 700 700 0.131064 5234.07 0.0133055
> 51557.7 6.81692e-18
> 800 800 800 0.198438 5160.3 0.0208701
> 49065.4 4.09087e-18
> 900 900 900 0.273346 5333.91 0.0244156
> 59716 2.53963e-18
> 1000 1000 1000 0.374021 5347.3 0.0252625
> 79168.7 1.64654e-18
> 1100 1100 1100 0.502426 5298.29 0.05022
> 53006.7 1.08395e-18
> 1200 1200 1200 0.865696 3992.16 0.0443738
> 77883.9 7.44661e-19
> 1300 1300 1300 1.00063 4391.23 0.0544683
> 80670.8 5.25559e-19
> 1400 1400 1400 1.26828 4327.13 0.0599685
> 91514.7 3.80933e-19
> 1500 1500 1500 1.3623 4954.86 0.0826977
> 81622.6 2.8281e-19
> 1600 1600 1600 2.14419 3820.56 0.0940622
> 87091.3 2.11718e-19
> 1700 1700 1700 2.98106 3296.14 0.104828
> 93734.3 1.61252e-19
> 1800 1800 1800 4.10679 2840.17 0.125856
> 92677.2 1.25247e-19
> 1900 1900 1900 7.25737 1890.22 0.137977
> 99422.2 9.85647e-20
> 2000 2000 2000 9.0378 1770.34 0.195959
> 81649.8 7.86877e-20
> 2100 2100 2100 7.43091 2492.56 0.205205
> 90261 6.31814e-20
> 2200 2200 2200 8.01552 2656.84 0.229878
> 92640.5 5.11206e-20
> 2300 2300 2300 11.3209 2149.47 0.242479
> 100355 4.19281e-20
> 2400 2400 2400 11.7655 2349.91 0.267819
> 103234 3.4696e-20
> 2500 2500 2500 14.75 2118.65 0.318302
> 98177.1 2.89065e-20
> 2600 2600 2600 16.1598 2175.27 0.349963
> 100445 2.42432e-20
> 2700 2700 2700 19.6465 2003.72 0.384713
> 102326 2.04284e-20
> 2800 2800 2800 18.5487 2366.95 0.422473
> 103922 1.73051e-20
> 2900 2900 2900 18.4844 2638.87 0.431616
> 113012 1.48037e-20
> 3000 3000 3000 18.3601 2941.16 0.487947
> 110668 1.27205e-20
> 3100 3100 3100 20.1449 2957.67 0.555138
> 107328 1.09745e-20
> 3200 3200 3200 22.2403 2946.72 0.597566
> 109672 9.49034e-21
> 3300 3300 3300 24.3526 2951.39 0.635492
> 113100 8.25459e-21
> 3400 3400 3400 26.5834 2957.04 0.693353
> 113374 7.22134e-21
> 3500 3500 3500 28.9996 2956.93 0.753307
> 113831 6.33808e-21
> 3600 3600 3600 31.4492 2967.07 0.793409
> 117609 5.58761e-21
> 3700 3700 3700 34.9533 2898.33 0.959263
> 105608 4.93129e-21
> 3800 3800 3800 38.2463 2869.4 1.01686
> 107924 4.36735e-21
> 3900 3900 3900 42.3957 2798.35 1.08582
> 109262 3.88282e-21
> 4000 4000 4000 44.7076 2863.05 1.22383
> 104590 3.469e-21
> [lehn_at_node042 session4]$ export OMP_NUM_THREADS=8; ./a.out
> # m n k uBLAS: t1 MFLOPS Blocked: t2
> MFLOPS Diff nrm1
> 100 100 100 0.00120762 1656.15 0.001279
> 1563.72 3.8463e-14
> 200 200 200 0.0036143 4426.86 0.000631185
> 25349.1 1.48858e-15
> 300 300 300 0.0108139 4993.56 0.00204664
> 26384.7 2.20015e-16
> 400 400 400 0.0251417 5091.13 0.00316074
> 40496.9 5.58204e-17
> 500 500 500 0.0482996 5176.03 0.00479854
> 52099.2 1.9429e-17
> 600 600 600 0.0830052 5204.49 0.0074349
> 58104.3 1.16567e-17
> 700 700 700 0.13281 5165.28 0.0134778
> 50898.6 6.82167e-18
> 800 800 800 0.19639 5214.12 0.0143988
> 71117.2 4.08235e-18
> 900 900 900 0.279542 5215.68 0.0186552
> 78155 2.54218e-18
> 1000 1000 1000 0.381906 5236.89 0.020541
> 97366.2 1.63963e-18
> 1100 1100 1100 0.509376 5226 0.0338259
> 78697.1 1.08399e-18
> 1200 1200 1200 0.760565 4543.99 0.0317094
> 108990 7.44215e-19
> 1300 1300 1300 1.04442 4207.14 0.0419104
> 104843 5.25101e-19
> 1400 1400 1400 1.47537 3719.75 0.0450985
> 121689 3.81236e-19
> 1500 1500 1500 1.90994 3534.15 0.0514728
> 131137 2.82394e-19
> 1600 1600 1600 1.56705 5227.67 0.0599189
> 136718 2.11847e-19
> 1700 1700 1700 2.62892 3737.66 0.0756787
> 129838 1.61316e-19
> 1800 1800 1800 3.29831 3536.35 0.0827417
> 140969 1.25087e-19
> 1900 1900 1900 4.03473 3399.98 0.0915113
> 149905 9.85857e-20
> 2000 2000 2000 4.87315 3283.3 0.105251
> 152017 7.86417e-20
> 2100 2100 2100 5.87975 3150.13 0.123634
> 149813 6.31281e-20
> 2200 2200 2200 7.06021 3016.34 0.134536
> 158293 5.11845e-20
> 2300 2300 2300 10.6045 2294.69 0.162671
> 149590 4.19035e-20
> 2400 2400 2400 9.31785 2967.21 0.160164
> 172623 3.46453e-20
> 2500 2500 2500 10.4852 2980.38 0.181067
> 172588 2.89024e-20
> 2600 2600 2600 11.8263 2972.35 0.208792
> 168359 2.42313e-20
> 2700 2700 2700 13.2755 2965.32 0.226646
> 173690 2.04063e-20
> 2800 2800 2800 14.8042 2965.65 0.24966
> 175855 1.73142e-20
> 2900 2900 2900 16.9983 2869.58 0.287892
> 169432 1.47875e-20
> 3000 3000 3000 19.7129 2739.32 0.330801
> 163240 1.27204e-20
> 3100 3100 3100 21.4476 2778.02 0.382773
> 155659 1.09704e-20
> 3200 3200 3200 22.7482 2880.93 0.440904
> 148640 9.49823e-21
> 3300 3300 3300 25.1449 2858.39 0.416183
> 172698 8.25712e-21
> 3400 3400 3400 27.4412 2864.6 0.497616
> 157969 7.2164e-21
> 3500 3500 3500 30.5976 2802.51 0.49974
> 171589 6.33781e-21
> 3600 3600 3600 32.5102 2870.24 0.558002
> 167225 5.59021e-21
> 3700 3700 3700 35.3566 2865.26 0.571003
> 177418 4.93007e-21
> 3800 3800 3800 37.321 2940.55 0.578064
> 189847 4.36563e-21
> 3900 3900 3900 40.1645 2953.8 0.623894
> 190157 3.8876e-21
> 4000 4000 4000 43.1753 2964.66 0.709328
> 180452 3.46575e-21
>
>
> On 28 Jan 2016, at 18:41, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>
>> Also the parallelisation with openmp is done pretty cheap and simple
>> at the moment. So you also
>> might want to check how it scales by
>>
>> export OMP_NUM_THREADS=2; ./a.out
>> export OMP_NUM_THREADS=4; ./a.out
>> export OMP_NUM_THREADS=6; ./a.out
>> ...
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: Oswin.Krause_at_[hidden]