Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-28 13:49:06


In the meantime some results from my Haswell machine. It has 4 quad cores. But there are
other jobs running so I went up to 8 threads. But anyway, the parallelisation is simple for
the maximal matrix dimension N=M=K=4000 it reaches

1) 32.9 GFLOPS with 1 thread
2) 63 GFLOPS with 2 threads
3) 104.6 GFLOPS with 4 threads
4) 180.5 GFLOPS with 8 threads

that is ok for a simple implementation but can be done better. Most of all it takes much too long (or
much to big problem sizes to scale well). But for the moment we should focus on a good single threaded
implementation and do the parallel stuff the right way later. As this will require more than just a single
#pragma omp parallel for

[lehn_at_node042 session4]$ g++ -Ofast -Wall -std=c++11 -DNDEBUG -DHAVE_FMA -I ../boost_1_60_0/ -fopenmp matprod.cc
[lehn_at_node042 session4]$ export OMP_NUM_THREADS=1; ./a.out
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
  100 100 100 0.00119632 1671.79 0.00089036 2246.28 3.90562e-14
  200 200 200 0.00322943 4954.44 0.00082579 19375.4 1.50135e-15
  300 300 300 0.0108177 4991.81 0.00221283 24403.1 2.18434e-16
  400 400 400 0.0247278 5176.35 0.00429661 29790.9 5.58593e-17
  500 500 500 0.053677 4657.49 0.00822185 30406.8 1.94899e-17
  600 600 600 0.0820133 5267.44 0.0136631 31617.9 1.16524e-17
  700 700 700 0.129231 5308.34 0.0208619 32882.9 6.82385e-18
  800 800 800 0.19206 5331.67 0.0309358 33100.8 4.08617e-18
  900 900 900 0.272354 5353.34 0.0430091 33899.8 2.54117e-18
 1000 1000 1000 0.372831 5364.36 0.0582482 34335.8 1.64011e-18
 1100 1100 1100 0.494906 5378.8 0.0796676 33413.8 1.08587e-18
 1200 1200 1200 0.642926 5375.43 0.098814 34974.8 7.43828e-19
 1300 1300 1300 0.815164 5390.32 0.125541 35000.5 5.26152e-19
 1400 1400 1400 1.04147 5269.48 0.154808 35450.4 3.81507e-19
 1500 1500 1500 1.24516 5420.99 0.187327 36033.2 2.82388e-19
 1600 1600 1600 1.5581 5257.68 0.236257 34674 2.12031e-19
 1700 1700 1700 2.57574 3814.82 0.273446 35933.9 1.61384e-19
 1800 1800 1800 3.24948 3589.5 0.319974 36453 1.25033e-19
 1900 1900 1900 4.01719 3414.82 0.378235 36268.4 9.86666e-20
 2000 2000 2000 4.82997 3312.65 0.438886 36456 7.86863e-20
 2100 2100 2100 5.88206 3148.89 0.517821 35769.1 6.31726e-20
 2200 2200 2200 6.87358 3098.24 0.590235 36080.6 5.1152e-20
 2300 2300 2300 8.08021 3011.55 0.659934 36873.4 4.19219e-20
 2400 2400 2400 9.31063 2969.51 0.748285 36948.5 3.46865e-20
 2500 2500 2500 10.5343 2966.51 0.84448 37005 2.88942e-20
 2600 2600 2600 11.8768 2959.71 0.984227 35715.3 2.42294e-20
 2700 2700 2700 13.3378 2951.45 1.06838 36846.5 2.04036e-20
 2800 2800 2800 14.9304 2940.57 1.18762 36968.2 1.73201e-20
 2900 2900 2900 16.8965 2886.87 1.33445 36552.8 1.47904e-20
 3000 3000 3000 18.7376 2881.9 1.49449 36132.7 1.27205e-20
 3100 3100 3100 20.8439 2858.48 1.66163 35857.5 1.09759e-20
 3200 3200 3200 22.9032 2861.44 1.82771 35856.9 9.49415e-21
 3300 3300 3300 28.2407 2545.05 2.08438 34482.2 8.25868e-21
 3400 3400 3400 27.5374 2854.6 2.18449 35984.7 7.22064e-21
 3500 3500 3500 29.925 2865.5 2.34372 36587.1 6.34137e-21
 3600 3600 3600 32.6588 2857.17 2.56586 36366.7 5.5874e-21
 3700 3700 3700 34.5032 2936.14 2.77154 36552.2 4.92873e-21
 3800 3800 3800 36.9099 2973.29 2.97732 36860.1 4.36811e-21
 3900 3900 3900 44.6497 2657.09 3.24271 36586.1 3.88313e-21
 4000 4000 4000 56.9767 2246.53 3.88046 32985.8 3.46672e-21
[lehn_at_node042 session4]$ export OMP_NUM_THREADS=2; ./a.out
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
  100 100 100 0.00120386 1661.33 0.000876976 2280.56 3.95867e-14
  200 200 200 0.00323702 4942.82 0.00099518 16077.5 1.50256e-15
  300 300 300 0.0106352 5077.5 0.00286667 18837.2 2.19644e-16
  400 400 400 0.0247765 5166.19 0.00610925 20951.8 5.61969e-17
  500 500 500 0.0478359 5226.2 0.00707235 35348.9 1.94268e-17
  600 600 600 0.082058 5264.57 0.0108406 39850.3 1.16982e-17
  700 700 700 0.129637 5291.71 0.0170924 40134.8 6.8281e-18
  800 800 800 0.1925 5319.48 0.0214161 47814.4 4.09348e-18
  900 900 900 0.273022 5340.22 0.0298684 48814.2 2.54562e-18
 1000 1000 1000 0.373113 5360.3 0.0417747 47875.9 1.64027e-18
 1100 1100 1100 0.499034 5334.3 0.0527302 50483.4 1.08356e-18
 1200 1200 1200 0.64351 5370.55 0.0624654 55326.6 7.44302e-19
 1300 1300 1300 0.829601 5296.52 0.0793488 55375.8 5.25547e-19
 1400 1400 1400 1.13615 4830.35 0.0937135 58561.5 3.8117e-19
 1500 1500 1500 1.38215 4883.71 0.11078 60931.4 2.82628e-19
 1600 1600 1600 2.34569 3492.37 0.148535 55152.1 2.11636e-19
 1700 1700 1700 2.80764 3499.73 0.166754 58925.2 1.61617e-19
 1800 1800 1800 3.65597 3190.4 0.183227 63658.6 1.25225e-19
 1900 1900 1900 6.04791 2268.22 0.229272 59832.8 9.8624e-20
 2000 2000 2000 5.41562 2954.41 0.244907 65331 7.8709e-20
 2100 2100 2100 5.79329 3197.15 0.320638 57766.1 6.31124e-20
 2200 2200 2200 10.1105 2106.32 0.348126 61173.2 5.11424e-20
 2300 2300 2300 11.746 2071.68 0.385373 63144 4.18844e-20
 2400 2400 2400 13.4099 2061.77 0.438608 63035.8 3.46829e-20
 2500 2500 2500 14.8645 2102.32 0.491434 63589.4 2.88839e-20
 2600 2600 2600 17.1602 2048.46 0.550163 63893.8 2.42378e-20
 2700 2700 2700 19.24 2046.05 0.616314 63873.3 2.03993e-20
 2800 2800 2800 14.8633 2953.85 0.675975 64949.2 1.73082e-20
 2900 2900 2900 18.533 2631.96 0.72636 67154.1 1.47984e-20
 3000 3000 3000 18.2701 2955.64 0.804625 67112 1.27211e-20
 3100 3100 3100 20.2371 2944.19 0.938507 63485.9 1.09831e-20
 3200 3200 3200 22.6838 2889.11 1.07581 60918.1 9.49232e-21
 3300 3300 3300 25.0228 2872.33 1.06473 67504.6 8.25942e-21
 3400 3400 3400 27.3561 2873.51 1.16247 67621.6 7.21511e-21
 3500 3500 3500 29.7889 2878.59 1.32098 64913.8 6.3385e-21
 3600 3600 3600 34.8098 2680.62 1.37908 67662.7 5.58738e-21
 3700 3700 3700 37.6151 2693.23 1.52253 66538.1 4.92976e-21
 3800 3800 3800 38.99 2814.67 1.63282 67211.5 4.36537e-21
 3900 3900 3900 57.5765 2060.53 1.75221 67707.4 3.88246e-21
 4000 4000 4000 51.1335 2503.25 2.03062 63035.1 3.46549e-21
[lehn_at_node042 session4]$ export OMP_NUM_THREADS=4; ./a.out
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
  100 100 100 0.00119733 1670.39 0.00124331 1608.61 3.84618e-14
  200 200 200 0.00427996 3738.35 0.000965206 16576.8 1.47604e-15
  300 300 300 0.0146617 3683.06 0.00235442 22935.6 2.18643e-16
  400 400 400 0.0301558 4244.62 0.00431311 29677 5.57089e-17
  500 500 500 0.0509763 4904.24 0.00541684 46152.4 1.94817e-17
  600 600 600 0.0823676 5244.78 0.00815973 52943 1.16851e-17
  700 700 700 0.131064 5234.07 0.0133055 51557.7 6.81692e-18
  800 800 800 0.198438 5160.3 0.0208701 49065.4 4.09087e-18
  900 900 900 0.273346 5333.91 0.0244156 59716 2.53963e-18
 1000 1000 1000 0.374021 5347.3 0.0252625 79168.7 1.64654e-18
 1100 1100 1100 0.502426 5298.29 0.05022 53006.7 1.08395e-18
 1200 1200 1200 0.865696 3992.16 0.0443738 77883.9 7.44661e-19
 1300 1300 1300 1.00063 4391.23 0.0544683 80670.8 5.25559e-19
 1400 1400 1400 1.26828 4327.13 0.0599685 91514.7 3.80933e-19
 1500 1500 1500 1.3623 4954.86 0.0826977 81622.6 2.8281e-19
 1600 1600 1600 2.14419 3820.56 0.0940622 87091.3 2.11718e-19
 1700 1700 1700 2.98106 3296.14 0.104828 93734.3 1.61252e-19
 1800 1800 1800 4.10679 2840.17 0.125856 92677.2 1.25247e-19
 1900 1900 1900 7.25737 1890.22 0.137977 99422.2 9.85647e-20
 2000 2000 2000 9.0378 1770.34 0.195959 81649.8 7.86877e-20
 2100 2100 2100 7.43091 2492.56 0.205205 90261 6.31814e-20
 2200 2200 2200 8.01552 2656.84 0.229878 92640.5 5.11206e-20
 2300 2300 2300 11.3209 2149.47 0.242479 100355 4.19281e-20
 2400 2400 2400 11.7655 2349.91 0.267819 103234 3.4696e-20
 2500 2500 2500 14.75 2118.65 0.318302 98177.1 2.89065e-20
 2600 2600 2600 16.1598 2175.27 0.349963 100445 2.42432e-20
 2700 2700 2700 19.6465 2003.72 0.384713 102326 2.04284e-20
 2800 2800 2800 18.5487 2366.95 0.422473 103922 1.73051e-20
 2900 2900 2900 18.4844 2638.87 0.431616 113012 1.48037e-20
 3000 3000 3000 18.3601 2941.16 0.487947 110668 1.27205e-20
 3100 3100 3100 20.1449 2957.67 0.555138 107328 1.09745e-20
 3200 3200 3200 22.2403 2946.72 0.597566 109672 9.49034e-21
 3300 3300 3300 24.3526 2951.39 0.635492 113100 8.25459e-21
 3400 3400 3400 26.5834 2957.04 0.693353 113374 7.22134e-21
 3500 3500 3500 28.9996 2956.93 0.753307 113831 6.33808e-21
 3600 3600 3600 31.4492 2967.07 0.793409 117609 5.58761e-21
 3700 3700 3700 34.9533 2898.33 0.959263 105608 4.93129e-21
 3800 3800 3800 38.2463 2869.4 1.01686 107924 4.36735e-21
 3900 3900 3900 42.3957 2798.35 1.08582 109262 3.88282e-21
 4000 4000 4000 44.7076 2863.05 1.22383 104590 3.469e-21
[lehn_at_node042 session4]$ export OMP_NUM_THREADS=8; ./a.out
# m n k uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
  100 100 100 0.00120762 1656.15 0.001279 1563.72 3.8463e-14
  200 200 200 0.0036143 4426.86 0.000631185 25349.1 1.48858e-15
  300 300 300 0.0108139 4993.56 0.00204664 26384.7 2.20015e-16
  400 400 400 0.0251417 5091.13 0.00316074 40496.9 5.58204e-17
  500 500 500 0.0482996 5176.03 0.00479854 52099.2 1.9429e-17
  600 600 600 0.0830052 5204.49 0.0074349 58104.3 1.16567e-17
  700 700 700 0.13281 5165.28 0.0134778 50898.6 6.82167e-18
  800 800 800 0.19639 5214.12 0.0143988 71117.2 4.08235e-18
  900 900 900 0.279542 5215.68 0.0186552 78155 2.54218e-18
 1000 1000 1000 0.381906 5236.89 0.020541 97366.2 1.63963e-18
 1100 1100 1100 0.509376 5226 0.0338259 78697.1 1.08399e-18
 1200 1200 1200 0.760565 4543.99 0.0317094 108990 7.44215e-19
 1300 1300 1300 1.04442 4207.14 0.0419104 104843 5.25101e-19
 1400 1400 1400 1.47537 3719.75 0.0450985 121689 3.81236e-19
 1500 1500 1500 1.90994 3534.15 0.0514728 131137 2.82394e-19
 1600 1600 1600 1.56705 5227.67 0.0599189 136718 2.11847e-19
 1700 1700 1700 2.62892 3737.66 0.0756787 129838 1.61316e-19
 1800 1800 1800 3.29831 3536.35 0.0827417 140969 1.25087e-19
 1900 1900 1900 4.03473 3399.98 0.0915113 149905 9.85857e-20
 2000 2000 2000 4.87315 3283.3 0.105251 152017 7.86417e-20
 2100 2100 2100 5.87975 3150.13 0.123634 149813 6.31281e-20
 2200 2200 2200 7.06021 3016.34 0.134536 158293 5.11845e-20
 2300 2300 2300 10.6045 2294.69 0.162671 149590 4.19035e-20
 2400 2400 2400 9.31785 2967.21 0.160164 172623 3.46453e-20
 2500 2500 2500 10.4852 2980.38 0.181067 172588 2.89024e-20
 2600 2600 2600 11.8263 2972.35 0.208792 168359 2.42313e-20
 2700 2700 2700 13.2755 2965.32 0.226646 173690 2.04063e-20
 2800 2800 2800 14.8042 2965.65 0.24966 175855 1.73142e-20
 2900 2900 2900 16.9983 2869.58 0.287892 169432 1.47875e-20
 3000 3000 3000 19.7129 2739.32 0.330801 163240 1.27204e-20
 3100 3100 3100 21.4476 2778.02 0.382773 155659 1.09704e-20
 3200 3200 3200 22.7482 2880.93 0.440904 148640 9.49823e-21
 3300 3300 3300 25.1449 2858.39 0.416183 172698 8.25712e-21
 3400 3400 3400 27.4412 2864.6 0.497616 157969 7.2164e-21
 3500 3500 3500 30.5976 2802.51 0.49974 171589 6.33781e-21
 3600 3600 3600 32.5102 2870.24 0.558002 167225 5.59021e-21
 3700 3700 3700 35.3566 2865.26 0.571003 177418 4.93007e-21
 3800 3800 3800 37.321 2940.55 0.578064 189847 4.36563e-21
 3900 3900 3900 40.1645 2953.8 0.623894 190157 3.8876e-21
 4000 4000 4000 43.1753 2964.66 0.709328 180452 3.46575e-21

On 28 Jan 2016, at 18:41, Michael Lehn <michael.lehn_at_[hidden]> wrote:

> Also the parallelisation with openmp is done pretty cheap and simple at the moment. So you also
> might want to check how it scales by
>
> export OMP_NUM_THREADS=2; ./a.out
> export OMP_NUM_THREADS=4; ./a.out
> export OMP_NUM_THREADS=6; ./a.out
> ...