Ublas :

Date view	Thread view	Subject view	Author view

Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-24 02:50:30

Next message: Oswin Krause: "Re: [ublas] Matrix multiplication performance"
Previous message: palik imre: "[ublas] Matrix multiplication performance"
Maybe in reply to: palik imre: "[ublas] Matrix multiplication performance"
Next in thread: Riccardo Rossi: "Re: [ublas] Matrix multiplication performance"
Reply: Riccardo Rossi: "Re: [ublas] Matrix multiplication performance"

Hi Michael,

I had a look on your AVX microkernel on my old AMD box. Congratulations, it is twice as fast as what I managed to get out of my optimised C kernel.

I wonder if I can catch up with that using newer gcc + SIMD arrays or intrinsics.

Cheers,

Imre

--------------------------------------------
On Fri, 22/1/16, Michael Lehn <michael.lehn_at_[hidden]> wrote:

Subject: Re: [ublas] Matrix multiplication performance
To: "palik imre" <imre_palik_at_[hidden]>
Cc: "ublas mailing list" <ublas_at_[hidden]>
Date: Friday, 22 January, 2016, 21:27

Wow Imre!

Ok, that is actually a
significant difference :-)

I have just added a new version to my site.Â
Unfortunately the computer I used for creating
the page does not have posix_memalign().Â So I
had to hack my own function for that.Â But
I think for a proof of concept it will do.

But most of all I also added
micro-kernel that are (fairly) optimised for AVX and FMA.Â
The
micro-kernel for AVX require MR=4,
NR=8.Â For FMA it requires MR=4, NR=16.Â Otherwise
the reference implementation gets selected.Â
For all parameters default values now can be
overwritten when compiling, e.g.

Â Â Â g++ -O3 -Wall
-std=c++11 -DHAVE_AVXÂ -DBS_D_MC=512Â matprod.cc

The optimised micro kernels
are only included when compiled with -DHAVE_AVX or
-DHAVE_FMA

I put all this
stuff here

Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html

also the tar-ball

Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz

contains all required
files:

Â Â Â
session2/avx.hpp
Â Â Â session2/fma.hpp
Â Â Â session2/gemm.hpp
Â Â Â
session2/matprod.cc

Of
course I would be interested how all this performs on other
platforms.

Cheers,

Michael

On 22
Jan 2016, at 16:20, palik imre <imre_palik_at_[hidden]>
wrote:

> Hi All,
>
> In the meantime I
enabled avx2 ...
>
>
Theoretical CPU performance maximum: clock * vector-size *
ALUs/Core * 2 ~ 69.6GFLOPS
> So it is at
~25%
>
> Compiler is
gcc 4.8.3
>
> vanilla
run:
> $ g++ -Wall -W -std=gnu++11 -Ofast
-march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
matprod matprod.cc
> $ ./matprod
> #Â Â Â mÂ Â Â Â nÂ
Â Â Â kÂ uBLAS:Â Â Â t1Â Â
Â Â Â MFLOPSÂ Â Â Blocked:Â Â Â t2Â
Â Â MFLOPSÂ Â Â Â Diff nrm1
>Â Â Â 100Â Â Â 100Â Â Â 100Â
0.000350402Â Â Â 5707.73Â Â Â 0.00116104Â Â Â
1722.59Â Â Â Â Â Â Â Â Â 0
>Â Â Â 200Â Â Â 200Â Â Â 200Â Â Â 0.00445094Â
Â Â 3594.74Â Â Â 0.00819996Â Â Â 1951.23Â Â Â Â
Â Â Â Â Â 0
>Â Â Â 300Â Â Â 300Â Â Â 300Â
Â 0.0138515Â Â Â 3898.49Â Â Â Â Â 0.0266515Â
Â Â 2026.15Â Â Â Â 1.06987e-06
>Â Â Â 400Â Â Â 400Â Â Â 400Â
Â 0.0266447Â Â Â 4803.96Â Â Â Â Â 0.0613506Â
Â Â 2086.37Â Â Â Â 4.01475e-06
>Â Â Â 500Â Â Â 500Â Â Â 500Â
Â 0.0424372Â Â Â 5891.06Â Â Â Â 0.119345Â Â Â
2094.77Â Â Â Â 8.99605e-06
>Â Â Â 600Â Â Â 600Â Â Â 600Â
Â 0.0724648Â Â Â 5961.51Â Â Â Â 0.203187Â Â Â
2126.12Â Â Â Â 1.63618e-05
>Â Â Â 700Â Â Â 700Â Â Â 700Â
Â Â Â 0.115464Â Â Â 5941.25Â Â Â Â 0.325834Â
Â Â 2105.36Â Â Â Â 2.69547e-05
>Â Â Â 800Â Â Â 800Â Â Â 800Â
Â Â Â 0.173003Â Â Â 5918.98Â Â Â Â 0.480655Â
Â Â 2130.43Â Â Â Â 4.09449e-05
>Â Â Â 900Â Â Â 900Â Â Â 900Â
Â Â Â 0.248077Â Â Â Â Â 5877.2Â Â Â Â
0.689972Â Â Â 2113.13Â Â Â Â 5.87376e-05
>Â 1000Â 1000Â 1000Â Â Â 0.33781Â Â
Â 5920.49Â Â Â Â 0.930591Â Â Â 2149.17Â
Â Â Â 8.16264e-05
>Â 1100Â
1100Â 1100Â Â Â Â Â 0.5149Â Â Â 5169.93Â Â
Â Â Â Â 1.25507Â Â Â Â Â Â 2121Â
Â Â Â 0.000108883
>Â 1200Â
1200Â 1200Â Â Â Â 0.670628Â Â Â 5153.38Â Â Â
Â Â Â 1.62732Â Â Â 2123.74Â
Â Â Â 0.000141876
>Â 1300Â
1300Â 1300Â Â Â Â 0.852692Â Â Â 5153.09Â Â Â
Â Â Â 2.06708Â Â Â 2125.71Â
Â Â Â 0.000180926
>Â 1400Â
1400Â 1400Â Â Â 1.06695Â Â Â 5143.65Â Â Â
Â Â Â 2.56183Â Â Â 2142.22Â
Â Â Â 0.000225975
>Â 1500Â
1500Â 1500Â Â Â Â Â 1.3874Â Â
Â Â Â 4865.2Â Â Â Â Â Â 3.16532Â Â Â
2132.49Â Â Â Â 0.000278553
>Â
1600Â 1600Â 1600Â Â Â 1.77623Â Â Â 4612.03Â Â Â
Â Â 3.8137Â Â Â 2148.05Â Â Â Â 0.000338106
>Â 1700Â 1700Â 1700Â Â
Â Â Â 2.3773Â Â Â 4133.26Â Â Â
Â Â Â 4.56665Â Â Â 2151.69Â
Â Â Â 0.000404458
>Â 1800Â
1800Â 1800Â Â Â 3.06381Â Â Â 3807.03Â Â Â
Â Â Â 5.40317Â Â Â 2158.73Â Â Â 0.00048119
>Â 1900Â 1900Â 1900Â Â
Â Â Â 3.9039Â Â Â 3513.92Â Â Â
Â Â Â 6.37295Â Â Â 2152.53Â
Â Â Â 0.000564692
>Â 2000Â
2000Â 2000Â Â Â 4.79166Â Â Â 3339.13Â Â Â
Â Â Â 7.43399Â Â Â 2152.28Â
Â Â Â 0.000659714
>Â 2100Â
2100Â 2100Â Â Â 6.04946Â Â Â 3061.76Â Â Â
Â Â Â 8.62429Â Â Â 2147.65Â
Â Â Â 0.000762223
>Â 2200Â
2200Â 2200Â Â Â 7.39085Â Â Â Â Â 2881.4Â Â
Â Â Â Â 9.86237Â Â Â 2159.32Â
Â Â Â 0.000875624
>Â 2300Â
2300Â 2300Â Â Â 9.00453Â Â Â 2702.42Â Â Â
Â Â Â 11.2513Â Â Â 2162.78Â Â Â 0.00100184
>Â 2400Â 2400Â 2400Â Â Â 10.3952Â Â
Â 2659.68Â Â Â Â Â Â 12.7491Â Â Â 2168.62Â
Â Â 0.00113563
>Â 2500Â 2500Â 2500Â
Â Â 12.2283Â Â Â 2555.55Â Â Â
Â Â Â 14.4615Â Â Â 2160.92Â Â Â 0.00128336
>Â 2600Â 2600Â 2600Â Â Â 13.8912Â Â
Â 2530.51Â Â Â Â Â Â 16.1965Â Â Â 2170.34Â
Â Â 0.00144304
>Â 2700Â 2700Â 2700Â
Â Â Â Â 15.391Â Â Â 2557.72Â Â Â
Â Â Â 18.1998Â Â Â 2162.99Â Â Â 0.00161411
>Â 2800Â 2800Â 2800Â Â Â 17.5673Â Â
Â 2499.19Â Â Â Â Â Â 20.2171Â Â Â 2171.63Â
Â Â 0.00180035
>Â 2900Â 2900Â 2900Â
Â Â 19.4621Â Â Â 2506.31Â Â Â
Â Â Â 22.5482Â Â Â 2163.28Â Â Â 0.00199765
>Â 3000Â 3000Â 3000Â Â Â 21.4506Â Â
Â 2517.42Â Â Â Â Â Â 24.9477Â Â Â 2164.53Â
Â Â 0.00221028
>Â 3100Â 3100Â 3100Â
Â Â Â 23.71Â Â Â 2512.95Â Â Â
Â Â Â 27.5144Â Â Â 2165.48Â Â Â 0.00243877
>Â 3200Â 3200Â 3200Â Â Â 25.9051Â Â
Â 2529.85Â Â Â Â Â Â 30.2816Â Â Â 2164.22Â
Â Â 0.00267766
>Â 3300Â 3300Â 3300Â
Â Â 28.1949Â Â Â 2549.18Â Â Â Â Â 33.176Â Â Â
2166.45Â Â Â 0.00293379
>Â 3400Â
3400Â 3400Â Â Â 30.7235Â Â Â 2558.56Â Â Â
Â Â Â 36.0156Â Â Â 2182.61Â Â
Â Â Â 0.0032087
>Â 3500Â 3500Â
3500Â Â Â 34.0419Â Â Â 2518.95Â Â Â
Â Â Â 39.3929Â Â Â 2176.79Â Â Â 0.00349827
>Â 3600Â 3600Â 3600Â Â Â 37.0562Â Â
Â 2518.12Â Â Â Â Â Â 42.7524Â Â Â 2182.62Â
Â Â 0.00380447
>Â 3700Â 3700Â 3700Â
Â Â 39.7885Â Â Â 2546.11Â Â Â
Â Â Â 46.4748Â Â Â 2179.81Â Â Â 0.00412621
>Â 3800Â 3800Â 3800Â Â Â 43.6607Â Â
Â 2513.56Â Â Â Â Â Â 50.2119Â Â Â 2185.62Â
Â Â Â Â 0.0044694
>Â 3900Â
3900Â 3900Â Â Â 46.5104Â Â Â 2550.78Â Â Â
Â Â Â 54.4822Â Â Â 2177.56Â Â Â 0.00482355
>Â 4000Â 4000Â 4000Â Â Â 50.6098Â Â
Â 2529.15Â Â Â Â Â Â 58.7686Â Â Â 2178.03Â
Â Â 0.00520289
>
>
tuned run:
>
> $ g++
-Wall -W -std=gnu++11 -Ofast -march=core-avx2
-mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2
matprod2.cc
> $ ./matprod2
> #Â Â Â mÂ Â Â Â nÂ
Â Â Â kÂ uBLAS:Â Â Â t1Â Â
Â Â Â MFLOPSÂ Â Â Blocked:Â Â Â t2Â
Â Â MFLOPSÂ Â Â Â Diff nrm1
>Â Â Â 100Â Â Â 100Â Â Â 100Â
0.000351671Â Â Â 5687.13Â Â Â Â 0.000316612Â Â
Â 6316.88Â Â Â Â Â Â Â Â Â 0
>Â Â Â 200Â Â Â 200Â Â Â 200Â Â Â 0.00419531Â
Â Â 3813.78Â Â Â 0.00159044Â Â Â 10060.1Â Â Â Â
Â Â Â Â Â 0
>Â Â Â 300Â Â Â 300Â Â Â 300Â
Â 0.0141153Â Â Â 3825.62Â Â Â 0.00421113Â Â Â
12823.2Â Â Â Â 1.07645e-06
>Â Â Â 400Â Â Â 400Â Â Â 400Â
Â 0.0291599Â Â Â 4389.59Â Â Â 0.00858138Â Â Â Â
14916Â Â Â Â 4.00614e-06
>Â Â Â 500Â Â Â 500Â Â Â 500Â
Â 0.0483492Â Â Â 5170.72Â Â Â Â Â 0.0166519Â
Â Â 15013.3Â Â Â Â 8.96808e-06
>Â Â Â 600Â Â Â 600Â Â Â 600Â
Â 0.0725783Â Â Â 5952.19Â Â Â Â Â 0.0279634Â
Â Â 15448.7Â Â Â Â 1.63386e-05
>Â Â Â 700Â Â Â 700Â Â Â 700Â
Â Â Â 0.113891Â Â Â 6023.29Â Â Â Â 0.043077Â
Â Â Â 15925Â Â Â Â 2.69191e-05
>Â Â Â 800Â Â Â 800Â Â Â 800Â
Â Â Â 0.171416Â Â Â 5973.79Â Â
Â Â Â 0.0627796Â Â Â Â 16311Â
Â Â Â 4.09782e-05
>Â Â Â 900Â Â Â 900Â Â Â 900Â
Â Â Â 0.243677Â Â Â 5983.32Â Â
Â Â Â 0.0922766Â Â Â 15800.3Â
Â Â Â 5.88092e-05
>Â 1000Â
1000Â 1000Â Â Â Â 0.335158Â Â Â 5967.33Â Â Â
Â 0.123339Â Â Â 16215.5Â Â Â Â 8.15988e-05
>Â 1100Â 1100Â 1100Â
Â Â Â 0.515776Â Â Â 5161.15Â Â Â
Â Â Â 0.16578Â Â Â 16057.5Â
Â Â Â 0.000108991
>Â 1200Â
1200Â 1200Â Â Â Â 0.662706Â Â Â 5214.98Â Â Â
Â 0.205989Â Â Â 16777.6Â Â Â Â 0.000141824
>Â 1300Â 1300Â 1300Â
Â Â Â 0.845952Â Â Â 5194.15Â Â Â
Â Â Â 0.27637Â Â Â Â 15899Â Â Â 0.00018111
>Â 1400Â 1400Â 1400Â Â Â 1.06712Â Â
Â 5142.82Â Â Â Â 0.332118Â Â Â 16524.2Â
Â Â Â 0.000225958
>Â 1500Â
1500Â 1500Â Â Â 1.38147Â Â Â 4886.11Â Â Â Â
0.409224Â Â Â 16494.6Â Â Â Â 0.000278265
>Â 1600Â 1600Â 1600Â Â Â 1.72238Â Â
Â 4756.21Â Â Â Â 0.492314Â Â Â 16639.8Â
Â Â Â 0.000338095
>Â 1700Â
1700Â 1700Â Â Â 2.38508Â Â Â 4119.77Â Â Â Â
0.603566Â Â Â 16279.9Â Â Â Â 0.000404362
>Â 1800Â 1800Â 1800Â Â Â 3.12034Â Â
Â 3738.05Â Â Â Â 0.717409Â Â Â 16258.5Â
Â Â Â 0.000481575
>Â 1900Â
1900Â 1900Â Â Â 3.93668Â Â Â 3484.66Â Â Â Â
0.824933Â Â Â 16629.2Â Â Â Â 0.000564727
>Â 2000Â 2000Â 2000Â Â Â 4.76038Â Â
Â 3361.07Â Â Â Â 0.941643Â Â Â 16991.6Â Â Â
0.00065862
>Â 2100Â 2100Â 2100Â Â Â
5.90627Â Â Â 3135.99Â Â Â Â Â Â 1.12226Â Â
Â 16504.2Â Â Â Â 0.000762307
>Â 2200Â 2200Â 2200Â Â Â 7.26419Â Â
Â 2931.64Â Â Â Â Â Â 1.28213Â Â Â 16609.9Â
Â Â Â 0.000876699
>Â 2300Â
2300Â 2300Â Â Â 8.88171Â Â Â 2739.79Â Â Â
Â Â Â 1.45247Â Â Â 16753.5Â Â Â 0.00100222
>Â 2400Â 2400Â 2400Â Â Â 10.4956Â Â
Â 2634.26Â Â Â Â Â Â 1.62705Â Â Â 16992.7Â
Â Â 0.00113566
>Â 2500Â 2500Â 2500Â
Â Â Â Â 11.913Â Â Â 2623.18Â Â Â
Â Â Â 1.87499Â Â Â 16666.7Â Â Â 0.00128371
>Â 2600Â 2600Â 2600Â Â Â 13.7057Â Â
Â 2564.77Â Â Â Â Â 2.1156Â Â Â 16615.6Â Â Â
0.00144259
>Â 2700Â 2700Â 2700Â Â Â
15.5959Â Â Â 2524.13Â Â Â Â Â Â 2.33957Â Â
Â 16826.1Â Â Â 0.00161501
>Â 2800Â
2800Â 2800Â Â Â 17.1121Â Â Â 2565.67Â Â Â
Â Â Â 2.57445Â Â Â 17053.8Â Â Â 0.00179901
>Â 2900Â 2900Â 2900Â Â Â 19.4167Â Â
Â 2512.16Â Â Â Â Â Â 2.92445Â Â Â 16679.4Â
Â Â 0.00199764
>Â 3000Â 3000Â 3000Â
Â Â 21.3239Â Â Â 2532.37Â Â Â
Â Â Â 3.18891Â Â Â 16933.7Â Â Â 0.00220999
>Â 3100Â 3100Â 3100Â Â Â 23.5049Â Â
Â 2534.88Â Â Â Â Â 3.5305Â Â Â 16876.4Â Â Â
0.00243845
>Â 3200Â 3200Â 3200Â Â Â
25.7362Â Â Â 2546.45Â Â Â Â Â Â 3.81708Â Â
Â 17169.1Â Â Â 0.00267581
>Â 3300Â
3300Â 3300Â Â Â 28.4467Â Â Â 2526.62Â Â Â
Â Â Â 4.25869Â Â Â Â 16877Â Â Â 0.00293513
>Â 3400Â 3400Â 3400Â Â Â 30.4607Â Â
Â 2580.63Â Â Â Â Â Â 4.67999Â Â Â 16796.6Â
Â Â 0.00320688
>Â 3500Â 3500Â 3500Â
Â Â 33.7737Â Â Â 2538.96Â Â Â
Â Â Â 5.04289Â Â Â 17004.1Â Â Â 0.00349667
>Â 3600Â 3600Â 3600Â Â Â 36.9633Â Â
Â 2524.45Â Â Â Â Â Â Â 5.414Â Â Â 17235.3Â
Â Â 0.00380237
>Â 3700Â 3700Â 3700Â
Â Â 39.5153Â Â Â 2563.71Â Â Â
Â Â Â 6.04875Â Â Â 16748.2Â Â Â 0.00412583
>Â 3800Â 3800Â 3800Â Â Â 42.9412Â Â
Â 2555.68Â Â Â Â Â Â 6.48985Â Â Â 16910.1Â
Â Â 0.00446785
>Â 3900Â 3900Â 3900Â
Â Â 46.5282Â Â Â 2549.81Â Â Â
Â Â Â 7.05844Â Â Â Â 16808Â Â Â 0.00482701
>Â 4000Â 4000Â 4000Â Â Â 50.2218Â Â
Â 2548.69Â Â Â Â Â Â 7.42442Â Â Â 17240.4Â
Â Â 0.00520272
>
>
As the generated machine code is completely different, I
guess gcc notices the aligned alloc, and uses the alignment
information for optimisation.
>
> Cheers,
>
> Imre
>
>
> On Friday, 22
January 2016, 15:09, Michael Lehn <michael.lehn_at_[hidden]>
wrote:
>
>
> Hi Imre,
>
> thanks for running the benchmarks.Â Of
course you are right that using aligned memory for the
buffers improves
> performance.Â I also
did not really put any effort in optimising the parameters
MC, NC, KC, MR and NR.Â I will
> compare
different variants and report them on the website
>
> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>
> I modified my
benchmark program such that it also computes the FLOPS as
>
> Â Â Â FLOPS =
2*m*n*k/time_elpased
>
> See
>
> Â Â Â http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>
> Could you re-run
your benchmarks and post the different MFLOPS you get? That
is important for actually tuning thing.
>
On my machine my code only reaches 20% of the peak
performance (about 5 GFLOPS instead of 25.6
GFLOPS).Â Â Â So
> a speedup of
2.5 would be impressive but still far from peak
performance.
>
>
Cheers,
>
>
Michael
>
>
> On 22 Jan 2016, at 11:03, palik imre
<imre_palik_at_[hidden]>
wrote:
>
>> Sorry
for posting twice more or less the same thing.Â I got
confused with javascript interfaces.
>>
>> It seems I
also forgot to enable avx for my last measurements.Â With
that + my blocking and alignment changes, performance
according to my tests is something like 250% higher than
running Michael's original code (with avx).
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 10:33,
palik imre <imre_palik_at_[hidden]>
wrote:
>>
>>

>> Hi Michael,
>>
>> your
blocksizes are far from optimal.Â MR & NR should be
multiples of the L1 cache line size (i.e. 16 for double on
Intel).Â Also, the blocks should be allocated aligned to L1
cache lines (e.g., via posix_memalign()).
>>
>> This alone
brought something like 50% speedup for my square matrix
test.
>>
>> I
will have a look at the other parameters + the whole thing
via perf during the weekend.
>>
>> Cheers,
>>
>> Imre
>>
>>
>>
>> On Friday, 22 January 2016, 0:28,
"ublas-request_at_[hidden]"
<ublas-request_at_[hidden]>
wrote:
>>
>>

>> Subject: Re: [ublas] Matrix
multiplication performance
>>
Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>> Content-Type: text/plain;
charset="windows-1252"
>>

>> Hi Nasos,
>>

>> first of all I don?t want to take
wrong credits and want to point out that this is not my
algorithm.Â It is based on
>>
>>Â Â Â Â http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>
>>Â
Â Â Â https://github.com/flame/blis
>>
>> For a few
cores (4-8) it can easily made multithreaded.Â For
many-cores like Intel Xeon Phi this is a bit more
>> sophisticated but still not too
hard.Â The demo I posted does not use micro kernels that
exploit SSE, AVX or
>> FMA
instructions.Â With that the matrix product is on par with
Intel MKL.Â Just like BLIS. For my platforms I wrote
>> my own micro-kernels but the interface
of function ugemm is compatible to BLIS.
>>
>> Maybe you
could help me to integrate your code in the benchmark
example I posted above.
>>
>> About Blaze:Â Do they have their own
implementation of a matrix-matrix product?Â It seems to
require a
>> tuned BLAS implementation
(?Otherwise you get only poor performance?) for the
matrix-matrix product.
>> IMHO they
only have tuned the ?easy? stuff like BLAS Level1 and
Level2.Â In that case it makes more
>> sense to compare the performance with
the actual underlying GEMM implementation.Â But if I am
wrong,
>> let me know.
>>
>> About the
block size: In my experience you get better performance if
you chose them dynamically at runtime
>> depending on the problem size.Â
Depending on the architecture you can just specify ranges
like 256 - 384 for
>> blocking factor
MC.Â In my code it then also needs to satisfy the
restriction that it can be divided by factor MR.
>> I know that doing things at compile
time is a C++ fetish.Â But the runtime overhead is
negligible and having
>> blocks of
similar sizes easily pays of.
>>
>> Cheers,
>>
>> Michael
>>
*************************************
>>
>>
>>
>>
>>
_______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
>
>
>

Next message: Oswin Krause: "Re: [ublas] Matrix multiplication performance"
Previous message: palik imre: "[ublas] Matrix multiplication performance"
Maybe in reply to: palik imre: "[ublas] Matrix multiplication performance"
Next in thread: Riccardo Rossi: "Re: [ublas] Matrix multiplication performance"
Reply: Riccardo Rossi: "Re: [ublas] Matrix multiplication performance"

Date view	Thread view	Subject view	Author view