Boost logo

Ublas :

Subject: Re: [ublas] considering the speed of axpy_prod
From: Ungermann, Jörn (j.ungermann_at_[hidden])
Date: 2012-01-02 10:39:54

Dear Oswin,

the matrix-matrix multiplication is not really optimized.
Please refer to my mail from 2010 for details

The performance of the product kernels really depends on the majority of all three involved matrices and becomes *really* complicated, once you take into account all flavours of sparse matrices.
It is ridicoulously easy to program a matrix-matrix-multiplication routine that is fast for any given, specific combination of involved matrices, but really, really, ahrd to be performant for a wide range of types and combination with a restricted set of kernels.

We went forward and implemented cache-optimal, SSE using routines for our common matrix-vector / matrix-matrix product types (about 2000 LoC, quite fun to do). But this stuff wouldn't fit into uBLAS.

Joern Ungermann

> -----Original Message-----
> From: ublas-bounces_at_[hidden] [mailto:ublas-
> bounces_at_[hidden]] On Behalf Of Oswin Krause
> Sent: Montag, 2. Januar 2012 15:14
> To: ublas_at_[hidden]
> Subject: [ublas] considering the speed of axpy_prod
> Hi list,
> I think my first mail to this mailing list was, that axpy_prod had a
> bad performance in a matrix-vector setup. I thought at first, that this
> was only subject to the matrix-vector multiplication. But this seems to
> be an inherent problem of the implementation of the axpy_prod.
> At the moment I'm implementing a matrix-multiplication like operation,
> so it was natural to benchmark it against ublas in the matrix-
> multiplication setup. The core operation is prod(A,trans(B)), but
> axpy_prod acted weird on that, so i took the time to Benchmark it:
> const std::size_t matRows=512;
> const std::size_t matColumns = 512;
> const std::size_t numVectors = 400;
> const std::size_t iterations= 125;
> blas::matrix<double,blas::row_major> testResult(numVectors ,matRows);
> blas::matrix<double,blas::row_major> argument(numVectors ,matColumns);
> blas::matrix<double,blas::row_major> matrix(matRows,matColumns);
> //initialize these matrices somehow...
> //...
> blas::matrix<double,blas::row_major> matrix2=trans(matrix);
> //Benchmark for versions 1-5
> double start=Timer::now();//some timer, replace with whatever you have
> for(std::size_t i = 0; i != iterations; ++i){
> //axpy_prod(argument,matrix2,testResult,false);//1
> //axpy_prod(argument,trans(matrix),testResult,false);//2
> //noalias(testResult)+=prod(argument,matrix2);//3
> //noalias(testResult)+=prod(argument,trans(matrix));//4
> //5 are two lines, I multiply with i to prevent optimization of
> the product
> blas::matrix<double,blas::column_major> matrix3=matrix;
> noalias(testResult)+=prod(argument,matrix3);
> }
> double end=Timer::now();
> std::cout<<end-start<<std::endl;
> //place here some code to prevent the compiler to optimize everything
> out
> I'm benchmarking on an older Intel Core2 E8400_at_3GHZ with 64KB L1 and
> 6MB
> L2 Cache. The numbers where chosen such that the matrices never
> exceeded L2. So I'm not targetting really big matrices, which results
> in an easier problem for matrix products.
> The compiler was GCC 4.6.2 with options -O3 -DBOOST_UBLAS_NDEBUG -
> Here are the results(from left to right: numVectors=200,400,800):
> matRows=matColumns=512
> (1) 8s 17s 34s
> (2) 25s 50s 100s
> (3) 24s 48s 96s
> (4) 7s 14s 28s
> (5) 7s 14s 28s (+ ~0.2s compared to 4)
> matRows=matColumns=256
> (1) 2.3s 4.5s 9s
> (2) 3.4s 6.8s 13.7s
> (3) 2.8s 5.7s 11.5s
> (4) 1.8 3.5 7s
> (5) 1.8 3.6 7s
> As written above, 1+2 are axpy_prod 3-5 are prod. There are two cases
> for 1-4, 1+4 are fast and 2+3 are slow. It turns out, that these cases
> are orthogonal. axpy_prod is fast, when both arguments are row major(1)
> and prod is fast, when the first argument is row major, the second
> column major(4).
> The not so fun thing is, that the best case of prod is faster than the
> best case of axpy_prod. This is why I added case (5). it turns out,
> even in the not so bad cases for axpy_prod copying the whole matrix
> into column major format and than calculating prod is actually faster
> than the call to axpy_prod(maybe there is a small skew in the benchmark
> when the compiler was smarter than expected...).
> I don't think, that this is working as intended. I can understand, why
> prod() must be slow in ublas when the arguments are both row major.
> What I can not understand is, that axpy_prod is that slow. Is it maybe
> optimized for different types of matrices?
> I would be happy if someone would find a flaw in my benchmark. I'm
> still new to this whole benchmark thing :)
> Greetings,
> Oswin
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> Sent to: j.ungermann_at_[hidden]

Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt

Kennen Sie schon unsere app?