Boost logo

Ublas :

From: Ian McCulloch (ianmcc_at_[hidden])
Date: 2005-09-21 17:26:56

Frank Astier wrote:

> I'm still trying to multiply a sparse matrix by a vector efficiently,
> comparing ublas to just looping over the whole matrix.
> For a square matrix of size 256x256 of floats with % non-zeros
> between 10 and 90%, ublas does not seem to be efficient compared to
> hand-crafted code that just carries out the naive matrix
> multiplication. axpy_prod performs worse than prod. sparse_matrix in
> 1_32 performs 3 times worse than compressed_matrix. I was hoping
> ublas would be efficient even with matrices of that size, and not
> that far off from "naive" code speed.

Firstly, your matrix is ALWAYS FULL, it is not sparse. Your initialization
code is:

    for (size_t i = 0; i < nrows; ++i)
      for (size_t j = 0; j < ncols; ++j)
        m(i, j) = vm[i][j] = float(rand()%10)/10.0 < pctNZ ? 1.0 : 0.0;;

Setting an element of a sparse matrix to zero really does insert an element
that is zero, it is not a no-operation (like in Matlab, for example).

You should change the last line to

         if (float(rand()%10)/10.0 < pctNZ) m(i,j) = 1.0;

It might be worth re-trying axpy_prod with this in mind. However, the usual
prod() in ublas is still slow for these fillings:

Linux 2.6.12-gentoo-r4 x86_64 AMD Athlon(tm) 64 Processor 3200+

% prod dense
0.1 2.63 0.16
0.2 4.27 0.16
0.3 5.46 0.16
0.4 6.26 0.16
0.5 6.67 0.16
0.6 7.05 0.17
0.7 7.18 0.16
0.8 7.19 0.16
0.9 7.15 0.16

(boost 1.32)

However, for matrix size 256x256 is too small nowdays to get good
performance from sparse algorithms - or rather, modern processors are very
fast at in-cache sequential accesses and penalize anything else very
heavily. Any time you have indirect adressing you will lose badly, and it
is very easy to stall the pipeline (eg, if the compiler cannot guarantee
there is no aliasing on the left hand side, it might pessimistially
sequence the operations so that it updates the left hand side before
loading the next right hand side element).

Also, a filling of 0.1 is not very sparse. This is probably the cost of an
indirect access (ie, my *very rough* guess would be that an optimized
sparse multiply with 10% non-zeros would be about the same speed as an
optimized dense multiply). For a sparse matrix implemented as a map or
list structure, you will not get any speed improvement for matrix-vector
multiply until you have maybe 0.001 non-zero elements.

Using one of my sparse matrix classes (implemented as a vector of std::map,
same as uBLAS I think (?) ), I get

fil prod dense
0.1 0.79 0.16
0.2 1.82 0.16
0.3 2.62 0.16
0.4 3.21 0.17
0.5 3.55 0.16
0.6 3.73 0.16
0.7 3.83 0.16
0.8 3.91 0.16
0.9 3.89 0.16

Still very bad compared with the dense case. To optimize matrix-vector
multiplies, you need a much better sparse format, SparseBLAS CSR or
something like that. In this case, inserting and removing elements are
expensive, matrix-vector multiplies are cheap (but still involve indirect
addressing). I think there is not such a storage format in uBLAS? But it
would be interesting to compare SparseBlas performance in this case.

Even moving to 2560x2560 matrices still has poor performance:

fil prod dense
0.1 292 22
0.2 440 21
0.3 562 20
0.4 632 21
0.5 678 21
0.6 714 21
0.7 731 21
0.8 734 21
0.9 725 21

The uBLAS result is at least a factor 2 slower than it could be, but even so
the dense performance blows it away. Interestingly, the naive
vector-of-vectors dense matrix has almost exactly the same performance as
sgemv from BLAS (acml-2.5.0) - for this matrix size it is completely
memory-bandwidth limited and further optimizing the dense-matrix vector
multiply is pointless. For matrices that fit in the L1 cache, sgemv is
twice as fast (or more) than the naive version.

In summary: if you want good sparse matrix-vector multiply performance, you
need to use a data structure optimized for access, rather than
insertion/removal. It would be very interesting to compare with sparse
BLAS performance .