
Ublas : 
From: Ian McCulloch (ianmcc_at_[hidden])
Date: 20050921 17:26:56
Frank Astier wrote:
> I'm still trying to multiply a sparse matrix by a vector efficiently,
> comparing ublas to just looping over the whole matrix.
>
> For a square matrix of size 256x256 of floats with % nonzeros
> between 10 and 90%, ublas does not seem to be efficient compared to
> handcrafted code that just carries out the naive matrix
> multiplication. axpy_prod performs worse than prod. sparse_matrix in
> 1_32 performs 3 times worse than compressed_matrix. I was hoping
> ublas would be efficient even with matrices of that size, and not
> that far off from "naive" code speed.
Firstly, your matrix is ALWAYS FULL, it is not sparse. Your initialization
code is:
for (size_t i = 0; i < nrows; ++i)
for (size_t j = 0; j < ncols; ++j)
m(i, j) = vm[i][j] = float(rand()%10)/10.0 < pctNZ ? 1.0 : 0.0;;
Setting an element of a sparse matrix to zero really does insert an element
that is zero, it is not a nooperation (like in Matlab, for example).
You should change the last line to
if (float(rand()%10)/10.0 < pctNZ) m(i,j) = 1.0;
It might be worth retrying axpy_prod with this in mind. However, the usual
prod() in ublas is still slow for these fillings:
Linux 2.6.12gentoor4 x86_64 AMD Athlon(tm) 64 Processor 3200+
% prod dense
0.1 2.63 0.16
0.2 4.27 0.16
0.3 5.46 0.16
0.4 6.26 0.16
0.5 6.67 0.16
0.6 7.05 0.17
0.7 7.18 0.16
0.8 7.19 0.16
0.9 7.15 0.16
(boost 1.32)
However, for matrix size 256x256 is too small nowdays to get good
performance from sparse algorithms  or rather, modern processors are very
fast at incache sequential accesses and penalize anything else very
heavily. Any time you have indirect adressing you will lose badly, and it
is very easy to stall the pipeline (eg, if the compiler cannot guarantee
there is no aliasing on the left hand side, it might pessimistially
sequence the operations so that it updates the left hand side before
loading the next right hand side element).
Also, a filling of 0.1 is not very sparse. This is probably the cost of an
indirect access (ie, my *very rough* guess would be that an optimized
sparse multiply with 10% nonzeros would be about the same speed as an
optimized dense multiply). For a sparse matrix implemented as a map or
list structure, you will not get any speed improvement for matrixvector
multiply until you have maybe 0.001 nonzero elements.
Using one of my sparse matrix classes (implemented as a vector of std::map,
same as uBLAS I think (?) ), I get
fil prod dense
0.1 0.79 0.16
0.2 1.82 0.16
0.3 2.62 0.16
0.4 3.21 0.17
0.5 3.55 0.16
0.6 3.73 0.16
0.7 3.83 0.16
0.8 3.91 0.16
0.9 3.89 0.16
Still very bad compared with the dense case. To optimize matrixvector
multiplies, you need a much better sparse format, SparseBLAS CSR or
something like that. In this case, inserting and removing elements are
expensive, matrixvector multiplies are cheap (but still involve indirect
addressing). I think there is not such a storage format in uBLAS? But it
would be interesting to compare SparseBlas performance in this case.
Even moving to 2560x2560 matrices still has poor performance:
fil prod dense
0.1 292 22
0.2 440 21
0.3 562 20
0.4 632 21
0.5 678 21
0.6 714 21
0.7 731 21
0.8 734 21
0.9 725 21
The uBLAS result is at least a factor 2 slower than it could be, but even so
the dense performance blows it away. Interestingly, the naive
vectorofvectors dense matrix has almost exactly the same performance as
sgemv from BLAS (acml2.5.0)  for this matrix size it is completely
memorybandwidth limited and further optimizing the densematrix vector
multiply is pointless. For matrices that fit in the L1 cache, sgemv is
twice as fast (or more) than the naive version.
In summary: if you want good sparse matrixvector multiply performance, you
need to use a data structure optimized for access, rather than
insertion/removal. It would be very interesting to compare with sparse
BLAS performance http://math.nist.gov/spblas/ .
Cheers,
Ian