From: Joerg Walter (jhr.walter_at_[hidden])
Date: 2002-06-27 17:24:17
----- Original Message -----
From: "Martin Weiser" <weiser_at_[hidden]>
Sent: Thursday, June 27, 2002 5:47 PM
Subject: Re: [boost] uBLAS formal review
> Dear all,
> without giving a full review report (I haven't yet used uBLAS sufficiently
> to post a meaningful statement), I'd like to draw the attention to two
> points I haven't read so far in the discussion.
> [ If you're familiar with cache blocking techniques for linear algebra,
> you may wish to skip right to the end of the posting. ]
> Although the abstraction penalty is mostly insignificant compared to
> equivalent handwritten code, there remains a huge performance gap to
> optimized numerical kernels. I've done a preliminary comparison of dgemm
> performance with different matrix sizes N (see
> http://www.zib.de/weiser/ublas_review.gif), measuring the flops/s
> multiplying two square matrices of size N (N=1,...,1000). The competitors
> are a vendor optimized BLAS implementation (Sun Performance Library) and
> a straightforward naive "C" implementation. The environtment is GCC 3.1
> (options -O3 -funroll-loops) on UltraSparc 10/300.
> The usual four diffrent regimes are clearly visible:
> 1. Very small matrices (N=1,...,5). Here the naive implementation wins due
> to its simplicity, which allows for probably perfect inlining. uBLAS and
> Sun BLAS are on par. Probably the function call for not inlined functions
This coincides with my experience.
> 2. Medium size matrices (N=7,..,30). All operands fit into the 1st level
> cache, and uBLAS and the naive implementation are on par. Sun BLAS wins
> by a factor of 2 or 3, which I think is to be attributed to the better
> optimization of the tuned vendor BLAS.
> 3. Large matrices (N=40,..,250). The operands don't fit any longer into
> the 1st level cache. As expected, the lower memory bandwidth of the 2nd
> level cache decreases the performance of both uBLAS and the naive
> implementation. Due to blocked algorithms the Sun BLAS maintains its high
> performance and wins by a factor of about 7. The great surprise is, that
> the naive implementation suffers less than uBLAS and wins by a factor of
> about 1.7. Currently I've no explanation for that.
We're switching from indexing to iterating access at N = 32. You could play
with NUMERICS_ITERATOR_THRESHOLD to see the impact on your platform.
> 4. Very large matrices (N>300). The operands don't fit any longer into the
> 2nd level cache, and main memory bandwidth limits the performance. uBLAS
> and the naive implementation are almost on par (factor 1.2), but the Sun
> BLAS maintains its performance and wins by a factor between 15 and 25.
> Since dense matrix matrix multiply is (or can be) a very important and
> central building block for numerical linear algebra algorithms, this
> performance behaviour is quite unsatisfactory. Don't get me wrong, I
> don't intend to criticise the great work by Jörg et al., but I think I
> should back up the suggestions below.
> [ If you skipped the description, read on from here. ]
> (i) Given the performance drop for large matrices, there should be a
> possibility to supply high performance, specialized algorithms for core
> algorithms such as _gemm. I tried to follow the instantiation and
> execution path for C.assign(prod(A,B)) and got the impression that it
> would be a hard job intercepting the expression template chain in order
> to provide a specialized algorithm (correct me if I'm wrong).
We'd have to specialize vector_assign<> and matrix_assign<>.
> The changes deep inside the ET machinery necessary to allow such
> augmentation probably do not require an interface change and hence seem
> to be well suited for the time after the review. However, I'd like the
> implementors to think about this topic and maybe write a paragraph in the
> documentation concerning the possibility, status, and future of such
> (ii) For very small vectors and matrices (say, for handling geometric
> quantities in R^2 or R^3), fixing the size at compile time may enable the
> compiler to perform perfect loop unrolling and hence speed up the
I'm very unsure, if this is feasible. (It's like unifying compile time and
run time, so we'd need a virtual machine ;-).
> I'd like to see a paragraph in the documentation whether it's possible to
> implement and mix compile time fixed size vectors/matrices with existing
> uBLAS components, and perhaps even a hint how to implement/use such a
> Besides this, I'd like to whisper a mostly unqualified "yes" for inclusion
> of uBLAS into Boost.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk