Yes, I forgot and that is criminal on my part. His code is an excellent starter to better understanding the implementation intricacies. I actually went through the code but hadn't got time to really study it. But no excuses, hugely grateful to Karl. I am really looking forward to working with him and all of you guys through the summer.

Best Regards,
Salman Javaid

On Mon, May 20, 2013 at 6:06 AM, Nasos Iliopoulos <> wrote:
You welcome, but I think Karl's contribution was much more useful, right? :)


On 05/19/2013 02:09 AM, Salman Javaid wrote:
Thank you, Oswin and Nasos for the detailed responses. I am hugely grateful. This is enormous help. I am now trying to dig into optimized algorithms for different matrices.

Oswin, thank you for the link to the paper. I am going through it.

Best Regards,
Salman Javaid

On Thu, May 16, 2013 at 7:06 AM, Karl Rupp <> wrote:
Hi guys,

I have a working implementation of QR for uBLAS in ViennaCL for about a year already:

Feel free to copy&paste and relicense as needed, I agree to whatever is necessary to integrate into uBLAS if of interest.

There is a bit of duplication for the ViennaCL types (requiring host<->GPU transfers) and the uBLAS types (operating in main RAM), yet it gives an idea how things can be implemented. It can certainly be improved here and there (more details on request), yet it addresses most of the points raised by Oswin. And it's faster than a standard LAPACK for sizes above ~1k times 1k.

I recommend extracting the Householder reflections into a nice separate interface, since this functionality will also be needed for other algorithms like SVD or GMRES. As a nice side effect, it makes the implementation for QR more compact.

Generally, in order to get *any* reasonable performance, one really needs to use matrix-matrix multiplications where possible in order to avoid the memory bandwidth bottleneck.

Best regards,

On 05/16/2013 01:06 AM, oswin krause wrote:

These are further good points!

I also came up with a few new ones(and tips!):

- QR needs pivoting. It's main usages are SVD and pseudo-inverses. In
both cases the input does not necessary have full rank. Also pivoting
helps for matrices with high condition numbers.

- For the same reasons H was not formed explicitly, Q should not be
formed. Instead there should be a version of the algorithm which does
only return the reflection vectors forming Q.

- For dense matrices at least, it is possible to do the QR in-place by
storing the R part as lower triangular and the householder
transformation vectors in the upper triangular. (That is very similar to
how LU is implemented).

- The best sources for algorithmic information are the LAPACK working notes.

In your case lawn114 sems to be the most relevant, even though it
assumes a fast BLAS3.


On 16.05.2013 05:32, Nasos Iliopoulos wrote:
That's not a bad start.

I think Oswin covered a whole lot of items here, but a more complete
algorithm needs to satisfy some or all of  the following:

- The algortihm should have a dispatch mechanism so that optimized
versions for various matrix types can be provided. (sparse, banded,
etc.). You don't necessarily need to provide them all to start with.
- The algorithm should operate on matrix expressions rather than
matrices (so it can be applied to for example subranges). Static
dispatch or overload if for some reason this seems to reduce performance.
- Const correctness is important. Try  using const reference on
immutable types.
- Instead of 0.00025 provide a value based on a user choice.If it is
hard coded by the user, the compiler will  probably convert it into a
const value.
- Don't use ints for indexing, use either std::size_t, or
container::size_type. If you need a signed type (i.e. to count for
differences on unsigned types) use ptrdiff_t. uBlas containers provide
a difference_type typedef for that purpose (i.e.
- use noalias(..) =  in every assignment that the lhs is not a part of
rhs, or when the algebraic operation is mapped 1-1. (i.e. A=2*A+B can
be written as noalias(A)=2*A+B, but A=prod(B,A)+D cannot atm). This
provides more benefits than just avoiding temporaries.

The QR decomposition of a 100x100 matrix should take no more than a
few miliseconds (or even less than a milisecond) to run.
A 1000x1000 should take around 1/3 to 1/10 of a sec.

Compile with:
g++ -O3 -DNDEBUG Main.cpp -o qrtest

Then you'll see that your code runs pretty fast, but it doesn't scale
well as Oswin noted.

Best regards,

On 05/15/2013 10:12 PM, Salman Javaid wrote:
Thank you, Oswin for the detailed response. I am going to update the

David, Nasos, any advise on coding conventions? Or anything else that
you can possible suggest? I will stand grateful.

Best Regards,
Salman Javaid

On Tue, May 14, 2013 at 10:53 PM, oswin krause
<>> wrote:


    in the order I stumbled over the things:

    line 44-54: you don't need a copy, instead you should use a
    combination of row/subrange.
    line 58-60: you should take a look at inner_prod
    line 63: 0.00025 is too big.
    line 66: You should never create H explicitly.
    line 67: because you formed H, this step is O(n^3) which makes
    the whole algorithm O(n^4). This can be done in O(n^2)
    line 73-79: same applies here.


    On 14.05.2013 22:12, Salman Javaid wrote:
    Hello uBLAS Contributors:

                                         I have applied to GSoC 2013
    and pitched implementation of SVD factorization for uBLAS. In
    order to better prepare myself and to get my hands dirty at
    uBLAS, I ended up implementing QR Factorization employing
    Householder Reflections using uBLAS. This is only the first
    draft and will be needing significant improvement, e.g.,
    computation of QR decomposition of 100 * 100 matrix takes around
    30 seconds. But I guess just to get familiar with code base, it
    was a good exercise. Over the next week or two I will be trying
    to optimize the code.

                                         I will be absolutely
    grateful if contributors can have a quick glance at the code,
    and point me to any improvements they can suggest. Particularly
    in speeding up matrix multiplication.

    I used Visual Studio 2010 to compile the code. I will try to get
    the code running on my Ubuntu machine in a couple of days hopefully.

    Here the header file:

    The main file:

    Best Regards,
    Salman Javaid

    ublas mailing list  <>
    Sent  <>

    ublas mailing list <>
    Sent to: <>

ublas mailing list

ublas mailing list

ublas mailing list
Sent to:

ublas mailing list
Sent to:

ublas mailing list
Sent to:

ublas mailing list
Sent to: