From: Toon Knapen (toon.knapen_at_[hidden])
Date: 2002-06-27 19:36:35
On Thursday 27 June 2002 11:47, Martin Weiser wrote:
> Dear all,
> without giving a full review report (I haven't yet used uBLAS sufficiently
> to post a meaningful statement), I'd like to draw the attention to two
> points I haven't read so far in the discussion.
> [ If you're familiar with cache blocking techniques for linear algebra,
> you may wish to skip right to the end of the posting. ]
> Although the abstraction penalty is mostly insignificant compared to
> equivalent handwritten code, there remains a huge performance gap to
> optimized numerical kernels. I've done a preliminary comparison of dgemm
> performance with different matrix sizes N (see
> http://www.zib.de/weiser/ublas_review.gif), measuring the flops/s
> multiplying two square matrices of size N (N=1,...,1000). The competitors
> are a vendor optimized BLAS implementation (Sun Performance Library) and
> a straightforward naive "C" implementation. The environtment is GCC 3.1
> (options -O3 -funroll-loops) on UltraSparc 10/300.
This type of graphs would be useful as uBLAS doc !
I also noticed this performance drop and it was to be expected unless uBLAS
would contain blocking strategies. To obtain maximal performance, about 8
months ago, also involving andrew lumsdaine, 2 different approaches have been
discussed which that can serve to obtain this performance :
1) Bindings to BLAS
Most high-performance computer vendors put a lot of effort into optimising
BLAS on their machines, writing the kernels directly in assembler. I don't
think any compiler is ever able to match this performance (first of all, a
compiler does not know about the different cache sizes that can be exploited).
This is at least a short-time solution.
I've also discussed this with Joerg and, as you suggest if I understand you
correctly, it is very hard to detect in an expression the sub-expression for
which a blas call can be maid. So the user should state explicitly if he
wants a blas-gemm or a ublas-prod.
I've started providing such bindings yesterday and in attachment you can find
my preliminary bindings.
2) optimising the uBLAS implementation
The review is mostly about the interface. Optimising the C++ implementation
for the different compilers should be looked at in a following stage IMHO.
This is very tricky and very subtle (just look at Jeremy's thesis, stating
the differences using a '==' or '<' condition in for loops) and needs to be
updated every time a new compiler is released.
This is a long-term solution.
I do think it can really coexist with the blas-calls because for small
matrices it's easier and, if the ublas implementation is heavily optimised,
long expressions could be quicker (due to the loop-fusion) than several calls
to blas (and creating temporaries). For instance, when multiplying 3
matrices, you need 2 gemm-cals and a temporary to store the result of the
first gemm. Here an optimised ublas should be able to outpace blas (correct
me if I'm wrong please).
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk