Boost logo

Ublas :

Subject: Re: [ublas] Proposal for sparse interface change/was Matrix decompositions.
From: oswin krause (oswin.krause_at_[hidden])
Date: 2013-05-29 02:41:33


Hi,

This is very good news, indeed!

Could I ask whether there are plans to rewrite the iterator interface of
ublas? As only the parallelization oriented projects were accepted to
ublas, are there still plans for the unification of vector and matrix
interface?

Otherwise i would like to propose a set of changes. I implemented most
of them in my local fork (which is however not fully operational because
this turned out to be a lot of work...), and it should resolve most of
the basic performance issues regarding sparse operations.

1. replace the current matrix iterators with iterators iterating ONLY a
single row or column. Remove dual iterators.
The interface could look like
typedef ... row_iterator;
typedef ... const_row_iterator;
row_iterator row_begin(size_type row_index), row_iterator
row_end(size_type row_index)
  (same for column).

Reasoning:
1.1 Dual operators are costly to implement as the iterators always need
two sets of states. This is no problem for dense as the set is more or
less the current index (i,j) but for sparse vectors there is a
considerable overhead (factor 3 in run time, not counting find() which
is also really slow)
1.2 The matrix iterators are essentially the same as for the vectors.
for example in matrix addition the exact same iterator can be used only
iterating over row-iterators instead of vector iterators. This saves a
lot of code
1.3 the same holds for row and column iterators! For example dense
row/column and dense vector iterators can be implemented using one class
(using the current position in the storage array and stride). Very often
this also holds for const and non-const.
1.4 Most proxies are easy to implement. for example matrix_row and
matrix_column can just return matrix_row(i); Also for example a
matrix_transpose-proxy class (replacing matrix_unary2 which can be
implemented using matrix_transpose + matrix_unary1) is incredible simple.
1.5 One could interpret this change as unification of vectors and
matrices by regarding matrices as a set of vectors(rows/columns).

There is a small disadvantage:
This proposed interface can't skip empty rows/columns in
sparse-matrices. However assuming nnz>rows this should not make a
notable difference in most cases.

On my local fork I used a ~25.000 loc subset of ublas which is now well
below 9000 lines after applying these changes.

2. replace make_conformant in assignment by something new.
The problem for sparse vectors/matrices is that aliasing has a different
meaning than for dense. for example two rows of a dense matrix only
alias if they are pointing to the same memory. This does not hold for
sparse matrices as changes in the one row might invalidate iterators in
the other.

The way ublas solves the problem is to first check which non-zero
elements are created by the assigned expression, than create the missing
entries in the target proxy and afterwards assign the expression (this
might need some documentation in matrix_assign, took my long to figure
out what make_conformant does).

This evaluates the expression twice because we first have to know which
elements are non-zero and than we can assign them. I propose to merge
both to one operation:

right now make_conformant creates an std::vector<std::size_t> (in the
vector-assignment case) storing the new indices which are than inserted
afterwards.

what about:

template<class value_type>
struct Element{
     std::size_t index;
     value_type value;
};

std::vector<Element<Vector::value_type> > new_elements?

In this way we would at last not have to evaluate the expression twice.

3. extend sparse interfaces.
Even with the proposed change, make_conformant is slow, because
inserting in a compressed_vector using operator() is O(log(n)).
so i propose a new method for vectors(and similar for matrices):

iterator set_element(iterator position,size_type index, value_type value);

which inserts the element (index,value) at position returning an
iterator to the newly inserted element.This way insertion of k elements
in a vector with n elements can be implmented in O(n+k). This could also
replace the current push_back method as it is a special case with
vec.set_element(vec.end(),index,value);

similarly we should have operations like reserve/reserve_row etc in
proxies to make the make_conformant case fast.

Also swapping might need some love as this is ridiculous right now (i
need that some times and right now just straight copying the reordered
matrices is a lot faster)

4. Only one sparse-matrix for all.
Optimizing code for sparse matrices is hard. Optimizing it for multiple
formats is impossibly hard. I would propose to remove all variants of
the sparse matrix except compressed_matrix. This should be reimplemented
such that it is not densely packed inside the arrays but instead there
should be a bit of space between the matrix rows (or columns depending
on orientation). This way inserting elements in the middle (as with
operator+=) is not as expensive.

Ohh, this was rather long. But I want to get this discussion rolling, as
I regard sparseness as an important feature of ublas. Please ask if you
have questions! I am happy to elaborate on unclear points.

Greetings,
Oswin

On 28.05.2013 17:41, Nasos Iliopoulos wrote:
> Yes,
> additionally with the boost transition to github we are planning to
> make contributing to uBlas much easier. We would like though to stress
> quality and find a way to define performance requirements
> specifications. When the github repo (a fork of boostorg/ublas) is
> ready (I hope I can have it ready this weekend) we will include
> instructions on how to work with pull requests (to submit an algorithm
> or a small bugfix for example), or in cases where larger involvement
> is required maybe give access to the ublas development repo to ppl
> that want to contribute.
>
> -Nasos
>
>
> On 05/28/2013 10:59 AM, Salman Javaid wrote:
>> Are there any plans to integrate Karl's QR Decomposition
>> implementation into Boost.uBLAS? I will be grateful for a response.
>>
>>
>>
>> Best Regards,
>> Salman Javaid
>>
>>
>> On Thu, May 16, 2013 at 7:06 AM, Karl Rupp <rupp_at_[hidden]
>> <mailto:rupp_at_[hidden]>> wrote:
>>
>> Hi guys,
>>
>> I have a working implementation of QR for uBLAS in ViennaCL for
>> about a year already:
>>
>> https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/qr.hpp
>>
>> Feel free to copy&paste and relicense as needed, I agree to
>> whatever is necessary to integrate into uBLAS if of interest.
>>
>> There is a bit of duplication for the ViennaCL types (requiring
>> host<->GPU transfers) and the uBLAS types (operating in main
>> RAM), yet it gives an idea how things can be implemented. It can
>> certainly be improved here and there (more details on request),
>> yet it addresses most of the points raised by Oswin. And it's
>> faster than a standard LAPACK for sizes above ~1k times 1k.
>>
>> I recommend extracting the Householder reflections into a nice
>> separate interface, since this functionality will also be needed
>> for other algorithms like SVD or GMRES. As a nice side effect, it
>> makes the implementation for QR more compact.
>>
>> Generally, in order to get *any* reasonable performance, one
>> really needs to use matrix-matrix multiplications where possible
>> in order to avoid the memory bandwidth bottleneck.
>>
>> Best regards,
>> Karli
>>
>>
>>
>> On 05/16/2013 01:06 AM, oswin krause wrote:
>>
>> Hi,
>>
>> These are further good points!
>>
>> I also came up with a few new ones(and tips!):
>>
>> - QR needs pivoting. It's main usages are SVD and
>> pseudo-inverses. In
>> both cases the input does not necessary have full rank. Also
>> pivoting
>> helps for matrices with high condition numbers.
>>
>> - For the same reasons H was not formed explicitly, Q should
>> not be
>> formed. Instead there should be a version of the algorithm
>> which does
>> only return the reflection vectors forming Q.
>>
>> - For dense matrices at least, it is possible to do the QR
>> in-place by
>> storing the R part as lower triangular and the householder
>> transformation vectors in the upper triangular. (That is very
>> similar to
>> how LU is implemented).
>>
>> - The best sources for algorithmic information are the LAPACK
>> working notes.
>> http://www.netlib.org/lapack/lawns/
>>
>> In your case lawn114 sems to be the most relevant, even though it
>> assumes a fast BLAS3.
>>
>> Greetings,
>> Oswin
>>
>>
>> On 16.05.2013 05:32, Nasos Iliopoulos wrote:
>>
>> That's not a bad start.
>>
>> I think Oswin covered a whole lot of items here, but a
>> more complete
>> algorithm needs to satisfy some or all of the following:
>>
>> - The algortihm should have a dispatch mechanism so that
>> optimized
>> versions for various matrix types can be provided.
>> (sparse, banded,
>> etc.). You don't necessarily need to provide them all to
>> start with.
>> - The algorithm should operate on matrix expressions
>> rather than
>> matrices (so it can be applied to for example subranges).
>> Static
>> dispatch or overload if for some reason this seems to
>> reduce performance.
>> - Const correctness is important. Try using const
>> reference on
>> immutable types.
>> - Instead of 0.00025 provide a value based on a user
>> choice.If it is
>> hard coded by the user, the compiler will probably
>> convert it into a
>> const value.
>> - Don't use ints for indexing, use either std::size_t, or
>> container::size_type. If you need a signed type (i.e. to
>> count for
>> differences on unsigned types) use ptrdiff_t. uBlas
>> containers provide
>> a difference_type typedef for that purpose (i.e.
>> matrix<double>::difference_type).
>> - use noalias(..) = in every assignment that the lhs is
>> not a part of
>> rhs, or when the algebraic operation is mapped 1-1. (i.e.
>> A=2*A+B can
>> be written as noalias(A)=2*A+B, but A=prod(B,A)+D cannot
>> atm). This
>> provides more benefits than just avoiding temporaries.
>>
>>
>> The QR decomposition of a 100x100 matrix should take no
>> more than a
>> few miliseconds (or even less than a milisecond) to run.
>> A 1000x1000 should take around 1/3 to 1/10 of a sec.
>>
>> Compile with:
>> g++ -O3 -DNDEBUG Main.cpp -o qrtest
>>
>> Then you'll see that your code runs pretty fast, but it
>> doesn't scale
>> well as Oswin noted.
>>
>> Best regards,
>> -Nasos
>>
>>
>>
>>
>> On 05/15/2013 10:12 PM, Salman Javaid wrote:
>>
>> Thank you, Oswin for the detailed response. I am
>> going to update the
>> code.
>>
>> David, Nasos, any advise on coding conventions? Or
>> anything else that
>> you can possible suggest? I will stand grateful.
>>
>>
>>
>>
>>
>> Best Regards,
>> Salman Javaid
>>
>>
>> On Tue, May 14, 2013 at 10:53 PM, oswin krause
>> <oswin.krause_at_[hidden]
>> <mailto:oswin.krause_at_[hidden]>
>> <mailto:oswin.krause_at_[hidden]
>> <mailto:oswin.krause_at_[hidden]>>> wrote:
>>
>> Hi,
>>
>> in the order I stumbled over the things:
>>
>> main.cpp
>> line 44-54: you don't need a copy, instead you
>> should use a
>> combination of row/subrange.
>> line 58-60: you should take a look at inner_prod
>> line 63: 0.00025 is too big.
>> line 66: You should never create H explicitly.
>> line 67: because you formed H, this step is
>> O(n^3) which makes
>> the whole algorithm O(n^4). This can be done in
>> O(n^2)
>> line 73-79: same applies here.
>>
>> Greetings,
>> Oswin
>>
>> On 14.05.2013 22:12, Salman Javaid wrote:
>>
>> Hello uBLAS Contributors:
>>
>> I have
>> applied to GSoC 2013
>> and pitched implementation of SVD
>> factorization for uBLAS. In
>> order to better prepare myself and to get my
>> hands dirty at
>> uBLAS, I ended up implementing QR
>> Factorization employing
>> Householder Reflections using uBLAS. This is
>> only the first
>> draft and will be needing significant
>> improvement, e.g.,
>> computation of QR decomposition of 100 * 100
>> matrix takes around
>> 30 seconds. But I guess just to get familiar
>> with code base, it
>> was a good exercise. Over the next week or
>> two I will be trying
>> to optimize the code.
>>
>>
>> I will
>> be absolutely
>> grateful if contributors can have a quick
>> glance at the code,
>> and point me to any improvements they can
>> suggest. Particularly
>> in speeding up matrix multiplication.
>>
>> I used Visual Studio 2010 to compile the
>> code. I will try to get
>> the code running on my Ubuntu machine in a
>> couple of days hopefully.
>>
>> Here the header file:
>> https://github.com/salmanjavaid/QR_Decomposition/blob/master/QR_Header.hpp
>>
>>
>> The main file:
>> https://github.com/salmanjavaid/QR_Decomposition/blob/master/Main.cpp
>>
>>
>> Best Regards,
>> Salman Javaid
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> <mailto:ublas_at_[hidden]>
>> <mailto:ublas_at_[hidden]
>> <mailto:ublas_at_[hidden]>>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:Oswin.Krause_at_[hidden]
>> <mailto:to%3AOswin.Krause_at_[hidden]>
>> <mailto:Oswin.Krause_at_[hidden]
>> <mailto:Oswin.Krause_at_[hidden]>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> <mailto:ublas_at_[hidden]
>> <mailto:ublas_at_[hidden]>>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: Javaid.Salman_at_[hidden]
>> <mailto:Javaid.Salman_at_[hidden]>
>> <mailto:Javaid.Salman_at_[hidden]
>> <mailto:Javaid.Salman_at_[hidden]>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:athanasios.iliopoulos.ctr.gr_at_[hidden]
>> <mailto:to%3Aathanasios.iliopoulos.ctr.gr_at_[hidden]>
>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:Oswin.Krause_at_[hidden]
>> <mailto:to%3AOswin.Krause_at_[hidden]>
>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: rupp_at_[hidden] <mailto:rupp_at_[hidden]>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: Javaid.Salman_at_[hidden] <mailto:Javaid.Salman_at_[hidden]>
>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:athanasios.iliopoulos.ctr.gr_at_[hidden]
>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: Oswin.Krause_at_[hidden]