Boost logo

Ublas :

Subject: Re: [ublas] Strange performance hits noticed - help!
From: Gunter Winkler (guwi17_at_[hidden])
Date: 2010-05-31 17:40:17


Hello Sunil,

answering performance questions is always difficult without a running
example. However, I'll try to give some ideas, where the problem might
arise. Generally, I try to use a profiler to track down such cases.
Could you try to run gprof (or similar) and send the output by direct
mail? Maybe I can identify the bottleneck.

Am Saturday 22 May 2010 schrieb Sunil Thomas:
> Dear ublas-developers,
>
> I switched from compressed_matrix storage to mapped_matrix and
> for a range of problems; mostly I saw
> very good results (whether it was to fill-in or traverse) in terms of
> performance improvement (factor of ~200
> improvement). And I am talking here about a problem size of about
> ~2.5 million unknowns approx (still small
> for my application..but that is another matter).

the default type of the mapped matrix is (from fwd.hpp):

template<class T, class L = row_major, class A = map_std<std::size_t, T>
> class mapped_matrix;

Thus the efficiency mostly depends on the used STL. Each index pair is
mapped to

> *My fill-in code follows something like this (relevant part pasted):*
>
> for (unsigned_int ic = 0; ic < faces.size(); ++ic) {
> [...]
> //Main diagonal elements
> matrix_A(uic1, uic1) += -coeff;
> matrix_A(uic2, uic2) += -coeff;
>
> //Off-diagonal elements
> matrix_A(uic1, uic2) += coeff;
> matrix_A(uic2, uic1) += coeff;
>
> } // end - loop on connections ic

This is the usual bottleneck for filling sparse matrices. For a mapped
matrix this calls find(key) on the map which is usually a O(log(nnz))
operation. For a compressed_matrix the complexity is only O(log(nnz per
row)). If the element must be inserted then additional time is needed -
for compressed_matrix this is a O(nnz) copy operation in worst case.

For comparison: using coordinate_matrix::append_element(i,j,value) has a
constant cost and is equivalent to A(i,j) += value. The price is more
memory consumption and a final sort-and-compress step.

> *My traversal code follows something like this (relevant part
> pasted):*
>
> Allocate cols, vals;
>
> for(itm1 i1 = mat_A.begin1(); i1 != mat_A.end1(); ++i1) {

hopefully mat_A is a const type here, otherwise mat_A.begin1() will most
likely return the mutable iterator which is much much slower that the
const iterator.

> int nnz = 0;
> itm2 i2 = i1.begin();

This looks like the problem because
mapped_matrix::const_iterator1::begin() calls mapped_matrix::find2()
which has complexity O(log(nnz)) for mapped_matrix and O(1) for
compressed matrix. (Assumption: The map has ordered keys because the
mapped matrix implementation must do a lower_bound search for the first
element pointing to the current row)

> //Loop over each row's non-zero elements
> for(; i2 != i1.end(); ++i2) {
> cols[nnz] = (int) i2.index2();
> vals[nnz] = *i2;
> ++nnz;
> }

a much more efficient way to iterator over all nonzeroes is to use the
data() member function. This function returns the internal storage
array.

>
> //Pass along cols, vals to "solver-package"...irrelvant here..
> (infact commented out when noting performance times).
>
> }
>
> I am doing wrong? It was recommended that I try
> generalized_vector_of_vector - will that choice resolve such issues?

I still vote for GVOV as best class for matrix assembly. Could you
please provide a good and a bad example for demonstration of the
timings?

mfg
Gunter