Ublas :

From: Frank Astier (fastier_at_[hidden])
Date: 2005-09-21 18:53:31

Ian -

Thanks for the correction. It got rid of the bizarre effect that
ublas seemed to always to the same amount of work, regardless of
sparsity :-)

I implemented my own sparse matrix too, optimized for multiplication
with dense vectors (I didn't even write random access to elements, I
don't need it!). Below 10% non zeros, I get at least 3 times better
than the naive dense multiplication, and up to 40 times better than
ublas on small matrix sizes (around 256x256). The matrices I handle
don't have any structure I can take advantage of, unfortunately.

Frank

On å¹³æˆ 17/09/21, at 15:26, Ian McCulloch wrote:

> Frank Astier wrote:
>
>
>> I'm still trying to multiply a sparse matrix by a vector efficiently,
>> comparing ublas to just looping over the whole matrix.
>>
>> For a square matrix of size 256x256 of floats with % non-zeros
>> between 10 and 90%, ublas does not seem to be efficient compared to
>> hand-crafted code that just carries out the naive matrix
>> multiplication. axpy_prod performs worse than prod. sparse_matrix in
>> 1_32 performs 3 times worse than compressed_matrix. I was hoping
>> ublas would be efficient even with matrices of that size, and not
>> that far off from "naive" code speed.
>>
>
> Firstly, your matrix is ALWAYS FULL, it is not sparse. Your
> initialization
> code is:
>
> for (size_t i = 0; i < nrows; ++i)
> for (size_t j = 0; j < ncols; ++j)
> m(i, j) = vm[i][j] = float(rand()%10)/10.0 < pctNZ ? 1.0 :
> 0.0;;
>
> Setting an element of a sparse matrix to zero really does insert an
> element
> that is zero, it is not a no-operation (like in Matlab, for example).
>
> You should change the last line to
>
> if (float(rand()%10)/10.0 < pctNZ) m(i,j) = 1.0;
>
> It might be worth re-trying axpy_prod with this in mind. However,
> the usual
> prod() in ublas is still slow for these fillings:
>
> Linux 2.6.12-gentoo-r4 x86_64 AMD Athlon(tm) 64 Processor 3200+
>
> % prod dense
> 0.1 2.63 0.16
> 0.2 4.27 0.16
> 0.3 5.46 0.16
> 0.4 6.26 0.16
> 0.5 6.67 0.16
> 0.6 7.05 0.17
> 0.7 7.18 0.16
> 0.8 7.19 0.16
> 0.9 7.15 0.16
>
> (boost 1.32)
>
> However, for matrix size 256x256 is too small nowdays to get good
> performance from sparse algorithms - or rather, modern processors
> are very
> fast at in-cache sequential accesses and penalize anything else very
> heavily. Any time you have indirect adressing you will lose badly,
> and it
> is very easy to stall the pipeline (eg, if the compiler cannot
> guarantee
> there is no aliasing on the left hand side, it might pessimistially
> sequence the operations so that it updates the left hand side before
>
> Also, a filling of 0.1 is not very sparse. This is probably the
> cost of an
> indirect access (ie, my *very rough* guess would be that an optimized
> sparse multiply with 10% non-zeros would be about the same speed as an
> optimized dense multiply). For a sparse matrix implemented as a
> map or
> list structure, you will not get any speed improvement for matrix-
> vector
> multiply until you have maybe 0.001 non-zero elements.
>
> Using one of my sparse matrix classes (implemented as a vector of
> std::map,
> same as uBLAS I think (?) ), I get
>
> fil prod dense
> 0.1 0.79 0.16
> 0.2 1.82 0.16
> 0.3 2.62 0.16
> 0.4 3.21 0.17
> 0.5 3.55 0.16
> 0.6 3.73 0.16
> 0.7 3.83 0.16
> 0.8 3.91 0.16
> 0.9 3.89 0.16
>
> Still very bad compared with the dense case. To optimize matrix-
> vector
> multiplies, you need a much better sparse format, SparseBLAS CSR or
> something like that. In this case, inserting and removing elements
> are
> expensive, matrix-vector multiplies are cheap (but still involve
> indirect
> addressing). I think there is not such a storage format in uBLAS?
> But it
> would be interesting to compare SparseBlas performance in this case.
>
> Even moving to 2560x2560 matrices still has poor performance:
>
> fil prod dense
> 0.1 292 22
> 0.2 440 21
> 0.3 562 20
> 0.4 632 21
> 0.5 678 21
> 0.6 714 21
> 0.7 731 21
> 0.8 734 21
> 0.9 725 21
>
> The uBLAS result is at least a factor 2 slower than it could be,
> but even so
> the dense performance blows it away. Interestingly, the naive
> vector-of-vectors dense matrix has almost exactly the same
> performance as
> sgemv from BLAS (acml-2.5.0) - for this matrix size it is completely
> memory-bandwidth limited and further optimizing the dense-matrix
> vector
> multiply is pointless. For matrices that fit in the L1 cache,
> sgemv is
> twice as fast (or more) than the naive version.
>
> In summary: if you want good sparse matrix-vector multiply
> performance, you
> need to use a data structure optimized for access, rather than
> insertion/removal. It would be very interesting to compare with
> sparse
> BLAS performance http://math.nist.gov/spblas/ .
>
> Cheers,
> Ian
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]

>