Forget OpenMP for the time being.
For
this gemm implementation to scale beyond a socket boundary, one needs
OpenMP version 4.0 or better, and some way to determine the processor
topology.
I
have no idea how to determine the processor topology in a portable way
(or any way outside Linux). If somebody is willing to help with this,
then we can pull it off. Otherwise I cannot do it.
# m uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
100 0.0013259 1508.4 0.000470025 4255.09 0 0.0004592 4355.4 0
200 0.000638351 25064.6 0.000245947 65054.7 0 0.000244369 65474.8 0
300 0.000895587 60295.6 0.000792129 68170.7 0 0.000760058 71047.2 0
400 0.00231167 55371.3 0.00189499 67546.4 0 0.00190057 67348.1 0
500 0.003532 70781.4 0.00312475 80006.4 0 0.00313642 79708.8 0
600 0.00603096 71630.4 0.00529135 81642.7 0 0.00535545 80665.5 0
700 0.00961982 71311.1 0.00780231 87922.7 0 0.00788078 87047.2 0
800 0.0143081 71567.8 0.0119397 85764.6 0 0.0123294 83053.5 0
900 0.0195524 74568.7 0.0175708 82978.6 0 0.0178417 81718.6 0
1000 0.0245773 81376 0.023048 86775.3 0 0.0233121 85792.3 0
1100 0.0321172 82883.9 0.0305309 87190.5 0 0.0307586 86544.8 0
1200 0.0388261 89012.2 0.0372826 92697.5 0 0.0376794 91721.1 0
1300 0.0479156 91702.8 0.046001 95519.7 0 0.0464177 94662.1 0
1400 0.0590871 92879.8 0.0571302 96061.3 0 0.0597637 91828.3 0
1500 0.0717213 94114.3 0.0697929 96714.7 0 0.0699795 96456.8 0
1600 0.0844559 96997.4 0.084161 97337.3 0 0.0837095 97862.3 0
1700 0.0980906 100173 0.0979744 100291 0 0.0979657 100300 0
1800 0.116091 100473 0.115783 100740 0 0.115882 100654 0
1900 0.135347 101354 0.134988 101624 0 0.136053 100828 0
2000 0.162126 98688.5 0.161714 98940.2 0 0.162389 98529.1 0
16 threads on 2 sockets:
# m uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff nrm1
100 0.00402575 496.801 0.00264263 756.823 0 0.00271382 736.969 0
200 0.00266775 5997.56 0.00222018 7206.64 0 0.00172627 9268.53 0
300 0.0037985 14216.1 0.00345001 15652.1 0 0.00337945 15978.9 0
400 0.00516891 24763.4 0.00502001 25498 0 0.00446124 28691.6 0
500 0.00761118 32846.4 0.00615818 40596.4 0 0.00594556 42048.2 0
600 0.0116084 37214.5 0.00914148 47257.1 0 0.0091653 47134.3 0
700 0.012626 54332.5 0.00967706 70889.3 0 0.0105016 65323.2 0
800 0.0161321 63476.1 0.01418 72214.5 0 0.0138772 73790.1 0
900 0.0169242 86148.9 0.0155385 93831.5 0 0.0152458 95632.8 0
1000 0.0208066 96123.5 0.0171705 116479 0 0.0178461 112069 0
1100 0.0399168 66688.7 0.0388316 68552.4 0 0.0383423 69427.2 0
1200 0.0448104 77124.9 0.0422857 81729.8 0 0.0425542 81214.1 0
1300 0.0509146 86301.4 0.0469193 93650.2 0 0.047686 92144.5 0
1400 0.0558425 98276.3 0.0579064 94773.7 0 0.0516464 106261 0
1500 0.0634977 106303 0.0638868 105656 0 0.0598486 112785 0
1600 0.0703613 116428 0.0725247 112955 0 0.0683293 119890 0
1700 0.0803688 122261 0.0790288 124334 0 0.07428 132283 0
1800 0.0860819 135499 0.089678 130065 0 0.0819196 142383 0
1900 0.098444 139348 0.0951359 144194 0 0.090089 152272 0
2000 0.107054 149457 0.120315 132985 0 0.102137 156652 0
all three versions are gemm-based.