
Ublas : 
Subject: Re: [ublas] Strange performance hits noticed  help!
From: Gunter Winkler (guwi17_at_[hidden])
Date: 20100531 17:40:17
Hello Sunil,
answering performance questions is always difficult without a running
example. However, I'll try to give some ideas, where the problem might
arise. Generally, I try to use a profiler to track down such cases.
Could you try to run gprof (or similar) and send the output by direct
mail? Maybe I can identify the bottleneck.
Am Saturday 22 May 2010 schrieb Sunil Thomas:
> Dear ublasdevelopers,
>
> I switched from compressed_matrix storage to mapped_matrix and
> for a range of problems; mostly I saw
> very good results (whether it was to fillin or traverse) in terms of
> performance improvement (factor of ~200
> improvement). And I am talking here about a problem size of about
> ~2.5 million unknowns approx (still small
> for my application..but that is another matter).
the default type of the mapped matrix is (from fwd.hpp):
template<class T, class L = row_major, class A = map_std<std::size_t, T>
> class mapped_matrix;
Thus the efficiency mostly depends on the used STL. Each index pair is
mapped to
> *My fillin code follows something like this (relevant part pasted):*
>
> for (unsigned_int ic = 0; ic < faces.size(); ++ic) {
> [...]
> //Main diagonal elements
> matrix_A(uic1, uic1) += coeff;
> matrix_A(uic2, uic2) += coeff;
>
> //Offdiagonal elements
> matrix_A(uic1, uic2) += coeff;
> matrix_A(uic2, uic1) += coeff;
>
> } // end  loop on connections ic
This is the usual bottleneck for filling sparse matrices. For a mapped
matrix this calls find(key) on the map which is usually a O(log(nnz))
operation. For a compressed_matrix the complexity is only O(log(nnz per
row)). If the element must be inserted then additional time is needed 
for compressed_matrix this is a O(nnz) copy operation in worst case.
For comparison: using coordinate_matrix::append_element(i,j,value) has a
constant cost and is equivalent to A(i,j) += value. The price is more
memory consumption and a final sortandcompress step.
> *My traversal code follows something like this (relevant part
> pasted):*
>
> Allocate cols, vals;
>
> for(itm1 i1 = mat_A.begin1(); i1 != mat_A.end1(); ++i1) {
hopefully mat_A is a const type here, otherwise mat_A.begin1() will most
likely return the mutable iterator which is much much slower that the
const iterator.
> int nnz = 0;
> itm2 i2 = i1.begin();
This looks like the problem because
mapped_matrix::const_iterator1::begin() calls mapped_matrix::find2()
which has complexity O(log(nnz)) for mapped_matrix and O(1) for
compressed matrix. (Assumption: The map has ordered keys because the
mapped matrix implementation must do a lower_bound search for the first
element pointing to the current row)
> //Loop over each row's nonzero elements
> for(; i2 != i1.end(); ++i2) {
> cols[nnz] = (int) i2.index2();
> vals[nnz] = *i2;
> ++nnz;
> }
a much more efficient way to iterator over all nonzeroes is to use the
data() member function. This function returns the internal storage
array.
>
> //Pass along cols, vals to "solverpackage"...irrelvant here..
> (infact commented out when noting performance times).
>
> }
>
> I am doing wrong? It was recommended that I try
> generalized_vector_of_vector  will that choice resolve such issues?
I still vote for GVOV as best class for matrix assembly. Could you
please provide a good and a bad example for demonstration of the
timings?
mfg
Gunter