|
Ublas : |
Subject: Re: [ublas] [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks
From: palik imre (imre_palik_at_[hidden])
Date: 2016-03-10 15:14:24
Forget OpenMP for the time being.
For this gemm implementation to scale beyond a socket boundary, one needs OpenMP version 4.0 or better, and some way to determine the processor topology.
I have no idea how to determine the processor topology in a portable way (or any way outside Linux). If somebody is willing to help with this, then we can pull it off. Otherwise I cannot do it.
BTW, some measurements of the basic idea on a 2*8 core Ivy Bridge:
Single threaded:#Â Â m uBLAS:Â Â t1Â Â Â Â Â MFLOPSÂ Blocked:Â Â t2Â Â Â Â MFLOPSÂ Â Â Â Â Â Diff nrm1
 100 0.000231629   8634.5 0.000179467  11144.1           0 0.000139046  14383.7           0
 200  0.00146996  10884.6 0.000969961  16495.5           0 0.000941657  16991.3           0
 300  0.00366689  14726.4  0.00328796  16423.6           0 0.00327107  16508.4           0
 400  0.00909807  14068.9  0.00871394  14689.1           0  0.0087016  14709.9           0
 500    0.017248  14494.4   0.0167666  14910.6           0  0.0167691  14908.3           0
 600   0.0289742  14909.8   0.0282574    15288           0  0.0282521  15290.9           0
 700   0.0462171    14843   0.0445143  15410.8           0   0.044418  15444.2           0
 800   0.0678457  15093.1   0.0645271  15869.3           0   0.065332  15673.8           0
 900   0.0998233  14605.8   0.0974452  14962.3           0  0.0977101  14921.7           0
 1000    0.133766  14951.4    0.131982  15153.6           0   0.132004  15151.1           0
 1100    0.177154  15026.5    0.175536    15165           0   0.175679  15152.7           0
 1200    0.226595  15251.9    0.223401  15469.9           0   0.223272  15478.9           0
 1300    0.281128  15629.9    0.279018  15748.1           0   0.279577  15716.6           0
 1400    0.358121  15324.4    0.358763    15297           0   0.356956  15374.5           0
 1500    0.441216  15298.6     0.44028  15331.2           0   0.440745    15315           0
 1600    0.539017    15198    0.540438  15158.1           0   0.538543  15211.4           0
 1700    0.631606  15557.2     0.63106  15570.6           0   0.632818  15527.4           0
 1800    0.740764  15745.9    0.740261  15756.6           0   0.740176  15758.4           0
 1900    0.885414  15493.3    0.885684  15488.6           0   0.885026  15500.1           0
 2000     1.04361  15331.3     1.04518  15308.4           0    1.04733    15277           0
8 threads on one socket:#Â Â m uBLAS:Â Â t1Â Â Â Â Â MFLOPSÂ Blocked:Â Â t2Â Â Â Â MFLOPSÂ Â Â Â Â Â Diff nrm1
 100   0.0013259   1508.4 0.000470025  4255.09           0  0.0004592   4355.4           0
 200 0.000638351  25064.6 0.000245947  65054.7           0 0.000244369  65474.8           0
 300 0.000895587  60295.6 0.000792129  68170.7           0 0.000760058  71047.2           0
 400  0.00231167  55371.3  0.00189499  67546.4           0 0.00190057  67348.1           0
 500    0.003532  70781.4  0.00312475  80006.4           0 0.00313642  79708.8           0
 600  0.00603096  71630.4  0.00529135  81642.7           0 0.00535545  80665.5           0
 700  0.00961982  71311.1  0.00780231  87922.7           0 0.00788078  87047.2           0
 800   0.0143081  71567.8   0.0119397  85764.6           0  0.0123294  83053.5           0
 900   0.0195524  74568.7   0.0175708  82978.6           0  0.0178417  81718.6           0
 1000   0.0245773    81376    0.023048  86775.3           0  0.0233121  85792.3           0
 1100   0.0321172  82883.9   0.0305309  87190.5           0  0.0307586  86544.8           0
 1200   0.0388261  89012.2   0.0372826  92697.5           0  0.0376794  91721.1           0
 1300   0.0479156  91702.8    0.046001  95519.7           0  0.0464177  94662.1           0
 1400   0.0590871  92879.8   0.0571302  96061.3           0  0.0597637  91828.3           0
 1500   0.0717213  94114.3   0.0697929  96714.7           0  0.0699795  96456.8           0
 1600   0.0844559  96997.4    0.084161  97337.3           0  0.0837095  97862.3           0
 1700   0.0980906   100173   0.0979744   100291           0  0.0979657   100300           0
 1800    0.116091   100473    0.115783   100740           0   0.115882   100654           0
 1900    0.135347   101354    0.134988   101624           0   0.136053   100828           0
 2000    0.162126  98688.5    0.161714  98940.2           0   0.162389  98529.1           0
16 threads on 2 sockets:#Â Â m uBLAS:Â Â t1Â Â Â Â Â MFLOPSÂ Blocked:Â Â t2Â Â Â Â MFLOPSÂ Â Â Â Â Â Diff nrm1
 100  0.00402575  496.801  0.00264263  756.823           0 0.00271382  736.969           0
 200  0.00266775  5997.56  0.00222018  7206.64           0 0.00172627  9268.53           0
 300   0.0037985  14216.1  0.00345001  15652.1           0 0.00337945  15978.9           0
 400  0.00516891  24763.4  0.00502001    25498           0 0.00446124  28691.6           0
 500  0.00761118  32846.4  0.00615818  40596.4           0 0.00594556  42048.2           0
 600   0.0116084  37214.5  0.00914148  47257.1           0  0.0091653  47134.3           0
 700    0.012626  54332.5  0.00967706  70889.3           0  0.0105016  65323.2           0
 800   0.0161321  63476.1     0.01418  72214.5           0  0.0138772  73790.1           0
 900   0.0169242  86148.9   0.0155385  93831.5           0  0.0152458  95632.8           0
 1000   0.0208066  96123.5   0.0171705   116479           0  0.0178461   112069           0
 1100   0.0399168  66688.7   0.0388316  68552.4           0  0.0383423  69427.2           0
 1200   0.0448104  77124.9   0.0422857  81729.8           0  0.0425542  81214.1           0
 1300   0.0509146  86301.4   0.0469193  93650.2           0   0.047686  92144.5           0
 1400   0.0558425  98276.3   0.0579064  94773.7           0  0.0516464   106261           0
 1500   0.0634977   106303   0.0638868   105656           0  0.0598486   112785           0
 1600   0.0703613   116428   0.0725247   112955           0  0.0683293   119890           0
 1700   0.0803688   122261   0.0790288   124334           0    0.07428   132283           0
 1800   0.0860819   135499    0.089678   130065           0  0.0819196   142383           0
 1900    0.098444   139348   0.0951359   144194           0   0.090089   152272           0
 2000    0.107054   149457    0.120315   132985           0   0.102137   156652           0
all three versions are gemm-based.
On Wednesday, 9 March 2016, 14:43, Nasos Iliopoulos <nasos_i_at_[hidden]> wrote:
I think the matrix abstraction the way it is now is ok. It would be confusing to have a large matrix class.
My current thinking is that we should have a two tier switch. One that detects that openmp is enabled and one that enables parallelization based on user preference:
#ifdef _OPENMP && BOOST_UBLAS_PARALLEL
// parallel code (with runtime switch if needed) goes here
#else
#ifdef BOOST_UBLAS_PARALLEL
#warning "OPENMP not present. boost::ublas parallel mode not enabled."
#end
// serial code goes here
#endif
to enable parallel mode:
gcc myfile.cpp -o myexe -fopenmp -DBOOST_UBLAS_PARALLEL
the following does not enabe ublas parallel mode but let the user's openmp code run:
gcc myfile.cpp -o myexe -fopenmp
this will not enable parallelization at all:
gcc myfile.cpp -o myexe
essentially _OPENMP is defined when you pass the -fopenmp argument to gcc and I suppose in all other compilers that support the standard.
* One downside of this approach is that temporarily disabling ublas parallel mode would need some hoci poci.
* I think that this approach is better than nothing and If you can think of a more clear and/or efficient way please voice it.
* I would favor the std::thread approach but thinking about it again I believe we will need to introduce state so the we have a facility to define the number of threads.We could use (http://en.cppreference.com/w/cpp/utility/program/getenv) but this wouldn't allow for after-execution changes. On the other hand openmp has state and the user can use it deliberately.
-Nasos