|
Ublas : |
Subject: Re: [ublas] [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks
From: Nasos Iliopoulos (nasos_i_at_[hidden])
Date: 2016-03-14 14:23:47
Yes, we can add a define:
BOOST_UBLAS_LEGACY_PRODUCT, to enable the old implementation.
-Nasos
On 03/12/2016 03:30 AM, Riccardo Rossi wrote:
> Dear Nasos,
> regarding your OPENMP_SWITCH + CONTROL statement it looks ok
> to me, however as you observed in your last email (+1 for the
> testing), for 2*2 matrices the if you'll have to put to choose a
> blocked implementation will matter.
>
> can't we have a DEFINE for that too? One option would be definitely
> one implements his own SmallDynamicMatrix class (compatible with
> ublas) and to use within own code. Unfortunately this is not a easy
> feat due to the arcane complexity of the ublas template system... (my
> view of that)
>
> cheers
> Riccardo
>
> On Fri, Mar 11, 2016 at 2:20 PM, Nasos Iliopoulos <nasos_i_at_[hidden]
> <mailto:nasos_i_at_[hidden]>> wrote:
>
> Regardless, these are great figures.
>
> Can you please run them comparing the simple uBlas implementation
> for matrices from 2 to 100 with the gemm based one with a signle
> thread? I wonder when the control statement starts to play a role.
>
> What do you think should be the plan to switch from multi-core to
> to single-threaded so as to not get all the communication hit for
> smaller matrices?
>
>
> - Nasos
>
>
> On 03/10/2016 03:14 PM, palik imre wrote:
>> Forget OpenMP for the time being.
>>
>> For this gemm implementation to scale beyond a socket boundary,
>> one needs OpenMP version 4.0 or better, and some way to determine
>> the processor topology.
>>
>> I have no idea how to determine the processor topology in a
>> portable way (or any way outside Linux). If somebody is willing
>> to help with this, then we can pull it off. Otherwise I cannot
>> do it.
>>
>> BTW, some measurements of the basic idea on a 2*8 core Ivy Bridge:
>>
>> Single threaded:
>> # m uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff
>> nrm1
>> 100 0.000231629 8634.5 0.000179467 11144.1 0
>> 0.000139046 14383.7 0
>> 200 0.00146996 10884.6 0.000969961 16495.5 0
>> 0.000941657 16991.3 0
>> 300 0.00366689 14726.4 0.00328796 16423.6 0
>> 0.00327107 16508.4 0
>> 400 0.00909807 14068.9 0.00871394 14689.1 0
>> 0.0087016 14709.9 0
>> 500 0.017248 14494.4 0.0167666 14910.6 0
>> 0.0167691 14908.3 0
>> 600 0.0289742 14909.8 0.0282574 15288 0
>> 0.0282521 15290.9 0
>> 700 0.0462171 14843 0.0445143 15410.8 0
>> 0.044418 15444.2 0
>> 800 0.0678457 15093.1 0.0645271 15869.3 0
>> 0.065332 15673.8 0
>> 900 0.0998233 14605.8 0.0974452 14962.3 0
>> 0.0977101 14921.7 0
>> 1000 0.133766 14951.4 0.131982 15153.6 0
>> 0.132004 15151.1 0
>> 1100 0.177154 15026.5 0.175536 15165 0
>> 0.175679 15152.7 0
>> 1200 0.226595 15251.9 0.223401 15469.9 0
>> 0.223272 15478.9 0
>> 1300 0.281128 15629.9 0.279018 15748.1 0
>> 0.279577 15716.6 0
>> 1400 0.358121 15324.4 0.358763 15297 0
>> 0.356956 15374.5 0
>> 1500 0.441216 15298.6 0.44028 15331.2 0
>> 0.440745 15315 0
>> 1600 0.539017 15198 0.540438 15158.1 0
>> 0.538543 15211.4 0
>> 1700 0.631606 15557.2 0.63106 15570.6 0
>> 0.632818 15527.4 0
>> 1800 0.740764 15745.9 0.740261 15756.6 0
>> 0.740176 15758.4 0
>> 1900 0.885414 15493.3 0.885684 15488.6 0
>> 0.885026 15500.1 0
>> 2000 1.04361 15331.3 1.04518 15308.4
>> 0 1.04733 15277 0
>>
>> 8 threads on one socket:
>> # m uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff
>> nrm1
>> 100 0.0013259 1508.4 0.000470025 4255.09 0
>> 0.0004592 4355.4 0
>> 200 0.000638351 25064.6 0.000245947 65054.7 0
>> 0.000244369 65474.8 0
>> 300 0.000895587 60295.6 0.000792129 68170.7 0
>> 0.000760058 71047.2 0
>> 400 0.00231167 55371.3 0.00189499 67546.4 0
>> 0.00190057 67348.1 0
>> 500 0.003532 70781.4 0.00312475 80006.4 0
>> 0.00313642 79708.8 0
>> 600 0.00603096 71630.4 0.00529135 81642.7 0
>> 0.00535545 80665.5 0
>> 700 0.00961982 71311.1 0.00780231 87922.7 0
>> 0.00788078 87047.2 0
>> 800 0.0143081 71567.8 0.0119397 85764.6 0
>> 0.0123294 83053.5 0
>> 900 0.0195524 74568.7 0.0175708 82978.6 0
>> 0.0178417 81718.6 0
>> 1000 0.0245773 81376 0.023048 86775.3 0
>> 0.0233121 85792.3 0
>> 1100 0.0321172 82883.9 0.0305309 87190.5 0
>> 0.0307586 86544.8 0
>> 1200 0.0388261 89012.2 0.0372826 92697.5 0
>> 0.0376794 91721.1 0
>> 1300 0.0479156 91702.8 0.046001 95519.7 0
>> 0.0464177 94662.1 0
>> 1400 0.0590871 92879.8 0.0571302 96061.3 0
>> 0.0597637 91828.3 0
>> 1500 0.0717213 94114.3 0.0697929 96714.7 0
>> 0.0699795 96456.8 0
>> 1600 0.0844559 96997.4 0.084161 97337.3 0
>> 0.0837095 97862.3 0
>> 1700 0.0980906 100173 0.0979744 100291 0
>> 0.0979657 100300 0
>> 1800 0.116091 100473 0.115783 100740 0
>> 0.115882 100654 0
>> 1900 0.135347 101354 0.134988 101624 0
>> 0.136053 100828 0
>> 2000 0.162126 98688.5 0.161714 98940.2 0
>> 0.162389 98529.1 0
>>
>> 16 threads on 2 sockets:
>> # m uBLAS: t1 MFLOPS Blocked: t2 MFLOPS Diff
>> nrm1
>> 100 0.00402575 496.801 0.00264263 756.823 0
>> 0.00271382 736.969 0
>> 200 0.00266775 5997.56 0.00222018 7206.64 0
>> 0.00172627 9268.53 0
>> 300 0.0037985 14216.1 0.00345001 15652.1 0
>> 0.00337945 15978.9 0
>> 400 0.00516891 24763.4 0.00502001 25498 0
>> 0.00446124 28691.6 0
>> 500 0.00761118 32846.4 0.00615818 40596.4 0
>> 0.00594556 42048.2 0
>> 600 0.0116084 37214.5 0.00914148 47257.1 0
>> 0.0091653 47134.3 0
>> 700 0.012626 54332.5 0.00967706 70889.3 0
>> 0.0105016 65323.2 0
>> 800 0.0161321 63476.1 0.01418 72214.5 0
>> 0.0138772 73790.1 0
>> 900 0.0169242 86148.9 0.0155385 93831.5 0
>> 0.0152458 95632.8 0
>> 1000 0.0208066 96123.5 0.0171705 116479 0
>> 0.0178461 112069 0
>> 1100 0.0399168 66688.7 0.0388316 68552.4 0
>> 0.0383423 69427.2 0
>> 1200 0.0448104 77124.9 0.0422857 81729.8 0
>> 0.0425542 81214.1 0
>> 1300 0.0509146 86301.4 0.0469193 93650.2 0
>> 0.047686 92144.5 0
>> 1400 0.0558425 98276.3 0.0579064 94773.7 0
>> 0.0516464 106261 0
>> 1500 0.0634977 106303 0.0638868 105656 0
>> 0.0598486 112785 0
>> 1600 0.0703613 116428 0.0725247 112955 0
>> 0.0683293 119890 0
>> 1700 0.0803688 122261 0.0790288 124334 0
>> 0.07428 132283 0
>> 1800 0.0860819 135499 0.089678 130065 0
>> 0.0819196 142383 0
>> 1900 0.098444 139348 0.0951359 144194 0
>> 0.090089 152272 0
>> 2000 0.107054 149457 0.120315 132985 0
>> 0.102137 156652 0
>>
>> all three versions are gemm-based.
>>
>> On Wednesday, 9 March 2016, 14:43, Nasos Iliopoulos
>> <nasos_i_at_[hidden]> <mailto:nasos_i_at_[hidden]> wrote:
>>
>>
>> I think the matrix abstraction the way it is now is ok. It would
>> be confusing to have a large matrix class.
>>
>> My current thinking is that we should have a two tier switch. One
>> that detects that openmp is enabled and one that enables
>> parallelization based on user preference:
>>
>> #ifdef _OPENMP && BOOST_UBLAS_PARALLEL
>> // parallel code (with runtime switch if needed) goes here
>> #else
>> #ifdef BOOST_UBLAS_PARALLEL
>> #warning "OPENMP not present. boost::ublas parallel mode not
>> enabled."
>> #end
>> // serial code goes here
>> #endif
>>
>> to enable parallel mode:
>> gcc myfile.cpp -o myexe -fopenmp -DBOOST_UBLAS_PARALLEL
>>
>> the following does not enabe ublas parallel mode but let the
>> user's openmp code run:
>> gcc myfile.cpp -o myexe -fopenmp
>>
>> this will not enable parallelization at all:
>> gcc myfile.cpp -o myexe
>>
>>
>> essentially _OPENMP is defined when you pass the -fopenmp
>> argument to gcc and I suppose in all other compilers that support
>> the standard.
>>
>> * One downside of this approach is that temporarily disabling
>> ublas parallel mode would need some hoci poci.
>>
>> * I think that this approach is better than nothing and If you
>> can think of a more clear and/or efficient way please voice it.
>>
>> * I would favor the std::thread approach but thinking about it
>> again I believe we will need to introduce state so the we have a
>> facility to define the number of threads.We could use
>> (http://en.cppreference.com/w/cpp/utility/program/getenv) but
>> this wouldn't allow for after-execution changes. On the other
>> hand openmp has state and the user can use it deliberately.
>>
>> -Nasos
>>
>>
>>
>>
>> _______________________________________________ ublas mailing
>> list ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:nasos_i_at_[hidden] <mailto:nasos_i_at_[hidden]>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: rrossi_at_[hidden] <mailto:rrossi_at_[hidden]>
>
>
>
>
> --
>
> *Riccardo Rossi
> *
>
> PhD, Civil Engineer
>
>
> member of the Kratos Team: www.cimne.com/kratos
> <http://www.cimne.com/kratos>
>
> lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
>
> Research fellow at International Center for Numerical Methods in
> Engineering (CIMNE)
>
>
> C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
>
> 08034 Barcelona Spain www.cimne.com <http://www.cimne.com> -
>
> T.(+34) 93 401 56 96 skype: *rougered4*
>
> <http://www.cimne.com/>
>
> <https://www.facebook.com/cimne><http://blog.cimne.com/><http://vimeo.com/cimne><http://www.youtube.com/user/CIMNEvideos><http://www.linkedin.com/company/cimne><https://twitter.com/cimne>
>
> Les dades personals contingudes en aquest missatge són tractades amb
> la finalitat de mantenir el contacte professional entre CIMNE i voste.
> Podra exercir els drets d'accés, rectificació, cancel·lació i
> oposició, dirigint-se a cimne_at_[hidden]
> <mailto:cimne_at_[hidden]>. La utilització de la seva adreça de
> correu electronic per part de CIMNE queda subjecte a les disposicions
> de la Llei 34/2002, de Serveis de la Societat de la Informació i el
> Comerç Electronic.
>
> Imprimiu aquest missatge, només si és estrictament necessari.
>
> <http://www.cimne.com/>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]