From: Martin Weiser (weiser_at_[hidden])
Date: 2002-06-28 05:01:41
On Freitag, 28. Juni 2002 02:36, Toon Knapen wrote:
> I also noticed this performance drop and it was to be expected unless uBLAS
> would contain blocking strategies. To obtain maximal performance, about 8
> months ago, also involving andrew lumsdaine, 2 different approaches have been
> discussed which that can serve to obtain this performance :
> 1) Bindings to BLAS
> Most high-performance computer vendors put a lot of effort into optimising
> BLAS on their machines, writing the kernels directly in assembler. I don't
> think any compiler is ever able to match this performance (first of all, a
> compiler does not know about the different cache sizes that can be exploited).
> This is at least a short-time solution.
> I've also discussed this with Joerg and, as you suggest if I understand you
> correctly, it is very hard to detect in an expression the sub-expression for
> which a blas call can be maid. So the user should state explicitly if he
> wants a blas-gemm or a ublas-prod.
> I've started providing such bindings yesterday and in attachment you can find
> my preliminary bindings.
Great. I'd suggest to incorporate such bindings into the uBLAS distribution, at
least in an "unsopported contributions - adapt to your needs" way.
> 2) optimising the uBLAS implementation
> The review is mostly about the interface. Optimising the C++ implementation
> for the different compilers should be looked at in a following stage IMHO.
Sure. Sometimes however, the interface more or less determines which
optimizations are possible. If you simply can't pass certain information
through the interface, you have to recompute it etc.
I didn't mean to request uBLAS optimization to be done for acceptance into
Boost. Instead I intended to draw some attention to the question whether
the interface or design prevents or permits certain optimizations. Since
I'm not familiar with the code, it's quite possible I simply overlooked
the suitable solution when I considered implementing a dgemm
specialization for prod().
> This is very tricky and very subtle (just look at Jeremy's thesis, stating
> the differences using a '==' or '<' condition in for loops) and needs to be
> updated every time a new compiler is released.
> This is a long-term solution.
> I do think it can really coexist with the blas-calls because for small
> matrices it's easier and, if the ublas implementation is heavily optimised,
> long expressions could be quicker (due to the loop-fusion) than several calls
> to blas (and creating temporaries). For instance, when multiplying 3
> matrices, you need 2 gemm-cals and a temporary to store the result of the
> first gemm. Here an optimised ublas should be able to outpace blas (correct
> me if I'm wrong please).
I guess you're completely right for small to mid-size matrices. For large ones,
of course, the n^3 operation count should simply dominate n^2 temporaries
Given the amount of work required for 2) and the heterogeneous interface of 1),
the alternative 1.5) may give 90% of the benefit for 10% of the cost:
1.5) Specializing prod() for suitable operations to call xgemm under the hood.
-- Dr. Martin Weiser Zuse Institute Berlin weiser_at_[hidden] Scientific Computing http://www.zib.de/weiser Numerical Analysis and Modelling
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk