
Ublas : 
Subject: [ublas] considering the speed of axpy_prod
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 20120102 09:13:53
Hi list,
I think my first mail to this mailing list was, that axpy_prod had a bad
performance in a matrixvector setup. I thought at first, that this was
only subject to the matrixvector multiplication. But this seems to be an
inherent problem of the implementation of the axpy_prod.
At the moment I'm implementing a matrixmultiplication like operation, so
it was natural to benchmark it against ublas in the matrixmultiplication
setup. The core operation is prod(A,trans(B)), but axpy_prod acted weird
on that, so i took the time to Benchmark it:
const std::size_t matRows=512;
const std::size_t matColumns = 512;
const std::size_t numVectors = 400;
const std::size_t iterations= 125;
blas::matrix<double,blas::row_major> testResult(numVectors ,matRows);
blas::matrix<double,blas::row_major> argument(numVectors ,matColumns);
blas::matrix<double,blas::row_major> matrix(matRows,matColumns);
//initialize these matrices somehow...
//...
blas::matrix<double,blas::row_major> matrix2=trans(matrix);
//Benchmark for versions 15
double start=Timer::now();//some timer, replace with whatever you have
for(std::size_t i = 0; i != iterations; ++i){
//axpy_prod(argument,matrix2,testResult,false);//1
//axpy_prod(argument,trans(matrix),testResult,false);//2
//noalias(testResult)+=prod(argument,matrix2);//3
//noalias(testResult)+=prod(argument,trans(matrix));//4
//5 are two lines, I multiply with i to prevent optimization of
the product
blas::matrix<double,blas::column_major> matrix3=matrix;
noalias(testResult)+=prod(argument,matrix3);
}
double end=Timer::now();
std::cout<<endstart<<std::endl;
//place here some code to prevent the compiler to optimize everything out
I'm benchmarking on an older Intel Core2 E8400_at_3GHZ with 64KB L1 and 6MB
L2 Cache. The numbers where chosen such that the matrices never exceeded
L2. So I'm not targetting really big matrices, which results in an easier
problem for matrix products.
The compiler was GCC 4.6.2 with options O3 DBOOST_UBLAS_NDEBUG DNDEBUG
Here are the results(from left to right: numVectors=200,400,800):
matRows=matColumns=512
(1) 8s 17s 34s
(2) 25s 50s 100s
(3) 24s 48s 96s
(4) 7s 14s 28s
(5) 7s 14s 28s (+ ~0.2s compared to 4)
matRows=matColumns=256
(1) 2.3s 4.5s 9s
(2) 3.4s 6.8s 13.7s
(3) 2.8s 5.7s 11.5s
(4) 1.8 3.5 7s
(5) 1.8 3.6 7s
As written above, 1+2 are axpy_prod 35 are prod. There are two cases for
14, 1+4 are fast and 2+3 are slow. It turns out, that these cases are
orthogonal. axpy_prod is fast, when both arguments are row major(1) and
prod is fast, when the first argument is row major, the second column
major(4).
The not so fun thing is, that the best case of prod is faster than the
best case of axpy_prod. This is why I added case (5). it turns out, even
in the not so bad cases for axpy_prod copying the whole matrix into column
major format and than calculating prod is actually faster than the call to
axpy_prod(maybe there is a small skew in the benchmark when the compiler
was smarter than expected...).
I don't think, that this is working as intended. I can understand, why
prod() must be slow in ublas when the arguments are both row major. What I
can not understand is, that axpy_prod is that slow. Is it maybe optimized
for different types of matrices?
I would be happy if someone would find a flaw in my benchmark. I'm still
new to this whole benchmark thing :)
Greetings,
Oswin