Boost logo

Ublas :

Subject: Re: [ublas] [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks
From: palik imre (imre_palik_at_[hidden])
Date: 2016-03-10 15:14:24


Forget OpenMP for the time being.
For this gemm implementation to scale beyond a socket boundary, one needs OpenMP version 4.0 or better, and some way to determine the processor topology.
I have no idea how to determine the processor topology in a portable way (or any way outside Linux).  If somebody is willing to help with this, then we can pull it off.  Otherwise I cannot do it.
BTW, some measurements of the basic idea on a 2*8 core Ivy Bridge:
Single threaded:#   m uBLAS:   t1      MFLOPS  Blocked:   t2     MFLOPS       Diff nrm1
  100  0.000231629    8634.5  0.000179467   11144.1            0 0.000139046   14383.7            0
  200   0.00146996   10884.6  0.000969961   16495.5            0 0.000941657   16991.3            0
  300   0.00366689   14726.4   0.00328796   16423.6            0  0.00327107   16508.4            0
  400   0.00909807   14068.9   0.00871394   14689.1            0   0.0087016   14709.9            0
  500     0.017248   14494.4    0.0167666   14910.6            0   0.0167691   14908.3            0
  600    0.0289742   14909.8    0.0282574     15288            0   0.0282521   15290.9            0
  700    0.0462171     14843    0.0445143   15410.8            0    0.044418   15444.2            0
  800    0.0678457   15093.1    0.0645271   15869.3            0    0.065332   15673.8            0
  900    0.0998233   14605.8    0.0974452   14962.3            0   0.0977101   14921.7            0
 1000     0.133766   14951.4     0.131982   15153.6            0    0.132004   15151.1            0
 1100     0.177154   15026.5     0.175536     15165            0    0.175679   15152.7            0
 1200     0.226595   15251.9     0.223401   15469.9            0    0.223272   15478.9            0
 1300     0.281128   15629.9     0.279018   15748.1            0    0.279577   15716.6            0
 1400     0.358121   15324.4     0.358763     15297            0    0.356956   15374.5            0
 1500     0.441216   15298.6      0.44028   15331.2            0    0.440745     15315            0
 1600     0.539017     15198     0.540438   15158.1            0    0.538543   15211.4            0
 1700     0.631606   15557.2      0.63106   15570.6            0    0.632818   15527.4            0
 1800     0.740764   15745.9     0.740261   15756.6            0    0.740176   15758.4            0
 1900     0.885414   15493.3     0.885684   15488.6            0    0.885026   15500.1            0
 2000      1.04361   15331.3      1.04518   15308.4            0     1.04733     15277            0

8 threads on one socket:#   m uBLAS:   t1      MFLOPS  Blocked:   t2     MFLOPS       Diff nrm1
  100    0.0013259    1508.4  0.000470025   4255.09            0   0.0004592    4355.4            0
  200  0.000638351   25064.6  0.000245947   65054.7            0 0.000244369   65474.8            0
  300  0.000895587   60295.6  0.000792129   68170.7            0 0.000760058   71047.2            0
  400   0.00231167   55371.3   0.00189499   67546.4            0  0.00190057   67348.1            0
  500     0.003532   70781.4   0.00312475   80006.4            0  0.00313642   79708.8            0
  600   0.00603096   71630.4   0.00529135   81642.7            0  0.00535545   80665.5            0
  700   0.00961982   71311.1   0.00780231   87922.7            0  0.00788078   87047.2            0
  800    0.0143081   71567.8    0.0119397   85764.6            0   0.0123294   83053.5            0
  900    0.0195524   74568.7    0.0175708   82978.6            0   0.0178417   81718.6            0
 1000    0.0245773     81376     0.023048   86775.3            0   0.0233121   85792.3            0
 1100    0.0321172   82883.9    0.0305309   87190.5            0   0.0307586   86544.8            0
 1200    0.0388261   89012.2    0.0372826   92697.5            0   0.0376794   91721.1            0
 1300    0.0479156   91702.8     0.046001   95519.7            0   0.0464177   94662.1            0
 1400    0.0590871   92879.8    0.0571302   96061.3            0   0.0597637   91828.3            0
 1500    0.0717213   94114.3    0.0697929   96714.7            0   0.0699795   96456.8            0
 1600    0.0844559   96997.4     0.084161   97337.3            0   0.0837095   97862.3            0
 1700    0.0980906    100173    0.0979744    100291            0   0.0979657    100300            0
 1800     0.116091    100473     0.115783    100740            0    0.115882    100654            0
 1900     0.135347    101354     0.134988    101624            0    0.136053    100828            0
 2000     0.162126   98688.5     0.161714   98940.2            0    0.162389   98529.1            0
16 threads on 2 sockets:#   m uBLAS:   t1      MFLOPS  Blocked:   t2     MFLOPS       Diff nrm1
  100   0.00402575   496.801   0.00264263   756.823            0  0.00271382   736.969            0
  200   0.00266775   5997.56   0.00222018   7206.64            0  0.00172627   9268.53            0
  300    0.0037985   14216.1   0.00345001   15652.1            0  0.00337945   15978.9            0
  400   0.00516891   24763.4   0.00502001     25498            0  0.00446124   28691.6            0
  500   0.00761118   32846.4   0.00615818   40596.4            0  0.00594556   42048.2            0
  600    0.0116084   37214.5   0.00914148   47257.1            0   0.0091653   47134.3            0
  700     0.012626   54332.5   0.00967706   70889.3            0   0.0105016   65323.2            0
  800    0.0161321   63476.1      0.01418   72214.5            0   0.0138772   73790.1            0
  900    0.0169242   86148.9    0.0155385   93831.5            0   0.0152458   95632.8            0
 1000    0.0208066   96123.5    0.0171705    116479            0   0.0178461    112069            0
 1100    0.0399168   66688.7    0.0388316   68552.4            0   0.0383423   69427.2            0
 1200    0.0448104   77124.9    0.0422857   81729.8            0   0.0425542   81214.1            0
 1300    0.0509146   86301.4    0.0469193   93650.2            0    0.047686   92144.5            0
 1400    0.0558425   98276.3    0.0579064   94773.7            0   0.0516464    106261            0
 1500    0.0634977    106303    0.0638868    105656            0   0.0598486    112785            0
 1600    0.0703613    116428    0.0725247    112955            0   0.0683293    119890            0
 1700    0.0803688    122261    0.0790288    124334            0     0.07428    132283            0
 1800    0.0860819    135499     0.089678    130065            0   0.0819196    142383            0
 1900     0.098444    139348    0.0951359    144194            0    0.090089    152272            0
 2000     0.107054    149457     0.120315    132985            0    0.102137    156652            0

all three versions are gemm-based.

    On Wednesday, 9 March 2016, 14:43, Nasos Iliopoulos <nasos_i_at_[hidden]> wrote:
 

  I think the matrix abstraction the way it is now is ok. It would be confusing to have a large matrix class.
 
 My current thinking is that we should have a two tier switch. One that detects that openmp is enabled and one that enables parallelization based on user preference:
 
 #ifdef _OPENMP && BOOST_UBLAS_PARALLEL
 // parallel code (with runtime switch if needed) goes here
 #else
 #ifdef BOOST_UBLAS_PARALLEL
 #warning "OPENMP not present. boost::ublas parallel mode not enabled."
 #end
 // serial code goes here
 #endif
 
 to enable parallel mode:
 gcc myfile.cpp -o myexe -fopenmp -DBOOST_UBLAS_PARALLEL
 
 the following does not enabe ublas parallel mode but let the user's openmp code run:
 gcc myfile.cpp -o myexe -fopenmp
 
 this will not enable parallelization at all:
 gcc myfile.cpp -o myexe
 
 
 essentially _OPENMP is defined when you pass the -fopenmp argument to gcc and I suppose in all other compilers that support the standard.
 
 * One downside of this approach is that temporarily disabling ublas parallel mode would need some hoci poci.
 
 * I think that this approach is better than nothing and If you can think of a more clear and/or efficient way please voice it.
 
 * I would favor the std::thread approach but thinking about it again I believe we will need to introduce state so the we have a facility to define the number of threads.We could use (http://en.cppreference.com/w/cpp/utility/program/getenv) but this wouldn't allow for after-execution changes. On the other hand openmp has state and the user can use it deliberately.
 
 -Nasos