Boost logo

Ublas :

Subject: [ublas] considering the speed of axpy_prod
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 2012-01-02 09:13:53

Hi list,

I think my first mail to this mailing list was, that axpy_prod had a bad
performance in a matrix-vector setup. I thought at first, that this was
only subject to the matrix-vector multiplication. But this seems to be an
inherent problem of the implementation of the axpy_prod.

At the moment I'm implementing a matrix-multiplication like operation, so
it was natural to benchmark it against ublas in the matrix-multiplication
setup. The core operation is prod(A,trans(B)), but axpy_prod acted weird
on that, so i took the time to Benchmark it:

const std::size_t matRows=512;
const std::size_t matColumns = 512;
const std::size_t numVectors = 400;
const std::size_t iterations= 125;

blas::matrix<double,blas::row_major> testResult(numVectors ,matRows);
blas::matrix<double,blas::row_major> argument(numVectors ,matColumns);
blas::matrix<double,blas::row_major> matrix(matRows,matColumns);

//initialize these matrices somehow...
blas::matrix<double,blas::row_major> matrix2=trans(matrix);

//Benchmark for versions 1-5
double start=Timer::now();//some timer, replace with whatever you have
for(std::size_t i = 0; i != iterations; ++i){
        //5 are two lines, I multiply with i to prevent optimization of
the product
        blas::matrix<double,blas::column_major> matrix3=matrix;
double end=Timer::now();

//place here some code to prevent the compiler to optimize everything out

I'm benchmarking on an older Intel Core2 E8400_at_3GHZ with 64KB L1 and 6MB
L2 Cache. The numbers where chosen such that the matrices never exceeded
L2. So I'm not targetting really big matrices, which results in an easier
problem for matrix products.

The compiler was GCC 4.6.2 with options -O3 -DBOOST_UBLAS_NDEBUG -DNDEBUG

Here are the results(from left to right: numVectors=200,400,800):

(1) 8s 17s 34s
(2) 25s 50s 100s
(3) 24s 48s 96s
(4) 7s 14s 28s
(5) 7s 14s 28s (+ ~0.2s compared to 4)

(1) 2.3s 4.5s 9s
(2) 3.4s 6.8s 13.7s
(3) 2.8s 5.7s 11.5s
(4) 1.8 3.5 7s
(5) 1.8 3.6 7s

As written above, 1+2 are axpy_prod 3-5 are prod. There are two cases for
1-4, 1+4 are fast and 2+3 are slow. It turns out, that these cases are
orthogonal. axpy_prod is fast, when both arguments are row major(1) and
prod is fast, when the first argument is row major, the second column

The not so fun thing is, that the best case of prod is faster than the
best case of axpy_prod. This is why I added case (5). it turns out, even
in the not so bad cases for axpy_prod copying the whole matrix into column
major format and than calculating prod is actually faster than the call to
axpy_prod(maybe there is a small skew in the benchmark when the compiler
was smarter than expected...).

I don't think, that this is working as intended. I can understand, why
prod() must be slow in ublas when the arguments are both row major. What I
can not understand is, that axpy_prod is that slow. Is it maybe optimized
for different types of matrices?

I would be happy if someone would find a flaw in my benchmark. I'm still
new to this whole benchmark thing :)