From: Martin Weiser (weiser_at_[hidden])
Date: 2002-06-27 10:47:13
without giving a full review report (I haven't yet used uBLAS sufficiently
to post a meaningful statement), I'd like to draw the attention to two
points I haven't read so far in the discussion.
[ If you're familiar with cache blocking techniques for linear algebra,
you may wish to skip right to the end of the posting. ]
Although the abstraction penalty is mostly insignificant compared to
equivalent handwritten code, there remains a huge performance gap to
optimized numerical kernels. I've done a preliminary comparison of dgemm
performance with different matrix sizes N (see
http://www.zib.de/weiser/ublas_review.gif), measuring the flops/s
multiplying two square matrices of size N (N=1,...,1000). The competitors
are a vendor optimized BLAS implementation (Sun Performance Library) and
a straightforward naive "C" implementation. The environtment is GCC 3.1
(options -O3 -funroll-loops) on UltraSparc 10/300.
The usual four diffrent regimes are clearly visible:
1. Very small matrices (N=1,...,5). Here the naive implementation wins due
to its simplicity, which allows for probably perfect inlining. uBLAS and
Sun BLAS are on par. Probably the function call for not inlined functions
2. Medium size matrices (N=7,..,30). All operands fit into the 1st level
cache, and uBLAS and the naive implementation are on par. Sun BLAS wins
by a factor of 2 or 3, which I think is to be attributed to the better
optimization of the tuned vendor BLAS.
3. Large matrices (N=40,..,250). The operands don't fit any longer into
the 1st level cache. As expected, the lower memory bandwidth of the 2nd
level cache decreases the performance of both uBLAS and the naive
implementation. Due to blocked algorithms the Sun BLAS maintains its high
performance and wins by a factor of about 7. The great surprise is, that
the naive implementation suffers less than uBLAS and wins by a factor of
about 1.7. Currently I've no explanation for that.
4. Very large matrices (N>300). The operands don't fit any longer into the
2nd level cache, and main memory bandwidth limits the performance. uBLAS
and the naive implementation are almost on par (factor 1.2), but the Sun
BLAS maintains its performance and wins by a factor between 15 and 25.
Since dense matrix matrix multiply is (or can be) a very important and
central building block for numerical linear algebra algorithms, this
performance behaviour is quite unsatisfactory. Don't get me wrong, I
don't intend to criticise the great work by Jörg et al., but I think I
should back up the suggestions below.
[ If you skipped the description, read on from here. ]
(i) Given the performance drop for large matrices, there should be a
possibility to supply high performance, specialized algorithms for core
algorithms such as _gemm. I tried to follow the instantiation and
execution path for C.assign(prod(A,B)) and got the impression that it
would be a hard job intercepting the expression template chain in order
to provide a specialized algorithm (correct me if I'm wrong).
The changes deep inside the ET machinery necessary to allow such
augmentation probably do not require an interface change and hence seem
to be well suited for the time after the review. However, I'd like the
implementors to think about this topic and maybe write a paragraph in the
documentation concerning the possibility, status, and future of such
(ii) For very small vectors and matrices (say, for handling geometric
quantities in R^2 or R^3), fixing the size at compile time may enable the
compiler to perform perfect loop unrolling and hence speed up the
I'd like to see a paragraph in the documentation whether it's possible to
implement and mix compile time fixed size vectors/matrices with existing
uBLAS components, and perhaps even a hint how to implement/use such a
Besides this, I'd like to whisper a mostly unqualified "yes" for inclusion
of uBLAS into Boost.
-- Dr. Martin Weiser Zuse Institute Berlin weiser_at_[hidden] Scientific Computing http://www.zib.de/weiser Numerical Analysis and Modelling
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk