Boost logo

Ublas :

Subject: Re: [ublas] considering the speed of axpy_prod
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 2012-01-02 12:07:57

Hello Jörn

I know, that implementing fast matrix-matrix products is hard. But note,
that my point was not "uBLAS does not achieve MKL performance" but
"axpy_prod is not as fast as prod even though it suggests, it is". Let me
elaborate a bit on that:

axpy_prod is documented as "computes <expression> in an optimized
fashion." And later: "Up to now there are some specialisation for
compressed matrices that give a large speed up compared to prod." So the
least thing i can assume is, that whatever i do, I'm at least as fast as
prod. This might still be terribly slow, but that's okay. Achieving this
is not hard, using the following simple implementation:

template<class MatA,class MatB, class MatC>
axpy_prod(MatA const& A, MatB const& B, MatC const& C){

No 2000 lines of code, just this. Of course, there are a lot of cases
where this default implementation does not fit well. So the second
simplest implementation is something like:

    class MatA,class MatB, class MatC,
    class TagA, class TagB, class TagC
axpy_prod_impl(MatA const& A, MatB const& B, MatC const& C, TagA, TagB,TagC){

template<class MatA,class MatB, class MatC>
axpy_prod_impl(MatA const& A, MatB const& B, MatC const& C,row_major,
    //my special implementation
axpy_prod(MatA const& A, MatB const& B, MatC const& C){
    axpy_prod_impl(A,B,C,MatA::orientation(), MatB::orientation(),
I agree, that there are a lot of special cases for sparse matrices of all
kinds, etc. But i don't think that they are as important as the usual
dense case, which just should work.
There are 8 combinations of row/column major for 3 dense arguments. Taking
symmetries into account one can reduce the whole to max 3 implementations.
Two of these implementations can be lead back to the matrix-vector
axpy_prod. In fact you would only need to implement the version for all 3
matrices being row major and than 4 wrappers which are not more than 5
lines of code each.

The thing which annoys me most about this is, that the code compiles
silently, doesn't produce warnings or exceptions and is just slow. There
is no hint in the documentation of axpy_prod that it might be a bad idea
to use it, and there are indeed a lot of code snippets out there which
suggest to forget prod() and just use axpy_prod instead.

Again: for me it doesn't matter when axpy_prod is slow, as long as it is
the fastest one can achieve with the means of uBLAS.

Oswin Krause

> Dear Oswin,
> the matrix-matrix multiplication is not really optimized.
> Please refer to my mail from 2010 for details
> The performance of the product kernels really depends on the majority of
> all three involved matrices and becomes *really* complicated, once you
> take into account all flavours of sparse matrices.
> It is ridicoulously easy to program a matrix-matrix-multiplication routine
> that is fast for any given, specific combination of involved matrices, but
> really, really, ahrd to be performant for a wide range of types and
> combination with a restricted set of kernels.
> We went forward and implemented cache-optimal, SSE using routines for our
> common matrix-vector / matrix-matrix product types (about 2000 LoC, quite
> fun to do). But this stuff wouldn't fit into uBLAS.
> Regards,
> Joern Ungermann
>> -----Original Message-----
>> From: ublas-bounces_at_[hidden] [mailto:ublas-
>> bounces_at_[hidden]] On Behalf Of Oswin Krause
>> Sent: Montag, 2. Januar 2012 15:14
>> To: ublas_at_[hidden]
>> Subject: [ublas] considering the speed of axpy_prod
>> Hi list,
>> I think my first mail to this mailing list was, that axpy_prod had a
>> bad performance in a matrix-vector setup. I thought at first, that this
>> was only subject to the matrix-vector multiplication. But this seems to
>> be an inherent problem of the implementation of the axpy_prod.
>> At the moment I'm implementing a matrix-multiplication like operation,
>> so it was natural to benchmark it against ublas in the matrix-
>> multiplication setup. The core operation is prod(A,trans(B)), but
>> axpy_prod acted weird on that, so i took the time to Benchmark it:
>> const std::size_t matRows=512;
>> const std::size_t matColumns = 512;
>> const std::size_t numVectors = 400;
>> const std::size_t iterations= 125;
>> blas::matrix<double,blas::row_major> testResult(numVectors ,matRows);
>> blas::matrix<double,blas::row_major> argument(numVectors ,matColumns);
>> blas::matrix<double,blas::row_major> matrix(matRows,matColumns);
>> //initialize these matrices somehow...
>> //...
>> blas::matrix<double,blas::row_major> matrix2=trans(matrix);
>> //Benchmark for versions 1-5
>> double start=Timer::now();//some timer, replace with whatever you have
>> for(std::size_t i = 0; i != iterations; ++i){
>> //axpy_prod(argument,matrix2,testResult,false);//1
>> //axpy_prod(argument,trans(matrix),testResult,false);//2
>> //noalias(testResult)+=prod(argument,matrix2);//3
>> //noalias(testResult)+=prod(argument,trans(matrix));//4
>> //5 are two lines, I multiply with i to prevent optimization of
>> the product
>> blas::matrix<double,blas::column_major> matrix3=matrix;
>> noalias(testResult)+=prod(argument,matrix3);
>> }
>> double end=Timer::now();
>> std::cout<<end-start<<std::endl;
>> //place here some code to prevent the compiler to optimize everything
>> out
>> I'm benchmarking on an older Intel Core2 E8400_at_3GHZ with 64KB L1 and
>> 6MB
>> L2 Cache. The numbers where chosen such that the matrices never
>> exceeded L2. So I'm not targetting really big matrices, which results
>> in an easier problem for matrix products.
>> The compiler was GCC 4.6.2 with options -O3 -DBOOST_UBLAS_NDEBUG -
>> Here are the results(from left to right: numVectors=200,400,800):
>> matRows=matColumns=512
>> (1) 8s 17s 34s
>> (2) 25s 50s 100s
>> (3) 24s 48s 96s
>> (4) 7s 14s 28s
>> (5) 7s 14s 28s (+ ~0.2s compared to 4)
>> matRows=matColumns=256
>> (1) 2.3s 4.5s 9s
>> (2) 3.4s 6.8s 13.7s
>> (3) 2.8s 5.7s 11.5s
>> (4) 1.8 3.5 7s
>> (5) 1.8 3.6 7s
>> As written above, 1+2 are axpy_prod 3-5 are prod. There are two cases
>> for 1-4, 1+4 are fast and 2+3 are slow. It turns out, that these cases
>> are orthogonal. axpy_prod is fast, when both arguments are row major(1)
>> and prod is fast, when the first argument is row major, the second
>> column major(4).
>> The not so fun thing is, that the best case of prod is faster than the
>> best case of axpy_prod. This is why I added case (5). it turns out,
>> even in the not so bad cases for axpy_prod copying the whole matrix
>> into column major format and than calculating prod is actually faster
>> than the call to axpy_prod(maybe there is a small skew in the benchmark
>> when the compiler was smarter than expected...).
>> I don't think, that this is working as intended. I can understand, why
>> prod() must be slow in ublas when the arguments are both row major.
>> What I can not understand is, that axpy_prod is that slow. Is it maybe
>> optimized for different types of matrices?
>> I would be happy if someone would find a flaw in my benchmark. I'm
>> still new to this whole benchmark thing :)
>> Greetings,
>> Oswin
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> Sent to: j.ungermann_at_[hidden]
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Kennen Sie schon unsere app?
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> Sent to: Oswin.Krause_at_[hidden]