Boost logo

Ublas :

Subject: Re: [ublas] considering the speed of axpy_prod
From: Ungermann, Jörn (j.ungermann_at_[hidden])
Date: 2012-01-02 10:39:54


Dear Oswin,

the matrix-matrix multiplication is not really optimized.
Please refer to my mail from 2010 for details
http://lists.boost.org/MailArchives/ublas/2010/03/4091.php

The performance of the product kernels really depends on the majority of all three involved matrices and becomes *really* complicated, once you take into account all flavours of sparse matrices.
It is ridicoulously easy to program a matrix-matrix-multiplication routine that is fast for any given, specific combination of involved matrices, but really, really, ahrd to be performant for a wide range of types and combination with a restricted set of kernels.

We went forward and implemented cache-optimal, SSE using routines for our common matrix-vector / matrix-matrix product types (about 2000 LoC, quite fun to do). But this stuff wouldn't fit into uBLAS.

Regards,
Joern Ungermann

> -----Original Message-----
> From: ublas-bounces_at_[hidden] [mailto:ublas-
> bounces_at_[hidden]] On Behalf Of Oswin Krause
> Sent: Montag, 2. Januar 2012 15:14
> To: ublas_at_[hidden]
> Subject: [ublas] considering the speed of axpy_prod
>
> Hi list,
>
> I think my first mail to this mailing list was, that axpy_prod had a
> bad performance in a matrix-vector setup. I thought at first, that this
> was only subject to the matrix-vector multiplication. But this seems to
> be an inherent problem of the implementation of the axpy_prod.
>
> At the moment I'm implementing a matrix-multiplication like operation,
> so it was natural to benchmark it against ublas in the matrix-
> multiplication setup. The core operation is prod(A,trans(B)), but
> axpy_prod acted weird on that, so i took the time to Benchmark it:
>
>
> const std::size_t matRows=512;
> const std::size_t matColumns = 512;
> const std::size_t numVectors = 400;
> const std::size_t iterations= 125;
>
> blas::matrix<double,blas::row_major> testResult(numVectors ,matRows);
> blas::matrix<double,blas::row_major> argument(numVectors ,matColumns);
> blas::matrix<double,blas::row_major> matrix(matRows,matColumns);
>
> //initialize these matrices somehow...
> //...
> blas::matrix<double,blas::row_major> matrix2=trans(matrix);
>
> //Benchmark for versions 1-5
> double start=Timer::now();//some timer, replace with whatever you have
> for(std::size_t i = 0; i != iterations; ++i){
> //axpy_prod(argument,matrix2,testResult,false);//1
> //axpy_prod(argument,trans(matrix),testResult,false);//2
> //noalias(testResult)+=prod(argument,matrix2);//3
> //noalias(testResult)+=prod(argument,trans(matrix));//4
> //5 are two lines, I multiply with i to prevent optimization of
> the product
> blas::matrix<double,blas::column_major> matrix3=matrix;
> noalias(testResult)+=prod(argument,matrix3);
> }
> double end=Timer::now();
>
> std::cout<<end-start<<std::endl;
> //place here some code to prevent the compiler to optimize everything
> out
>
> I'm benchmarking on an older Intel Core2 E8400_at_3GHZ with 64KB L1 and
> 6MB
> L2 Cache. The numbers where chosen such that the matrices never
> exceeded L2. So I'm not targetting really big matrices, which results
> in an easier problem for matrix products.
>
> The compiler was GCC 4.6.2 with options -O3 -DBOOST_UBLAS_NDEBUG -
> DNDEBUG
>
> Here are the results(from left to right: numVectors=200,400,800):
>
> matRows=matColumns=512
> (1) 8s 17s 34s
> (2) 25s 50s 100s
> (3) 24s 48s 96s
> (4) 7s 14s 28s
> (5) 7s 14s 28s (+ ~0.2s compared to 4)
>
> matRows=matColumns=256
> (1) 2.3s 4.5s 9s
> (2) 3.4s 6.8s 13.7s
> (3) 2.8s 5.7s 11.5s
> (4) 1.8 3.5 7s
> (5) 1.8 3.6 7s
>
> As written above, 1+2 are axpy_prod 3-5 are prod. There are two cases
> for 1-4, 1+4 are fast and 2+3 are slow. It turns out, that these cases
> are orthogonal. axpy_prod is fast, when both arguments are row major(1)
> and prod is fast, when the first argument is row major, the second
> column major(4).
>
> The not so fun thing is, that the best case of prod is faster than the
> best case of axpy_prod. This is why I added case (5). it turns out,
> even in the not so bad cases for axpy_prod copying the whole matrix
> into column major format and than calculating prod is actually faster
> than the call to axpy_prod(maybe there is a small skew in the benchmark
> when the compiler was smarter than expected...).
>
> I don't think, that this is working as intended. I can understand, why
> prod() must be slow in ublas when the arguments are both row major.
> What I can not understand is, that axpy_prod is that slow. Is it maybe
> optimized for different types of matrices?
>
> I would be happy if someone would find a flaw in my benchmark. I'm
> still new to this whole benchmark thing :)
>
> Greetings,
> Oswin
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: j.ungermann_at_[hidden]

------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Kennen Sie schon unsere app? http://www.fz-juelich.de/app