From: Matthew Hurd (matt_at_[hidden])
Date: 2002-06-27 20:57:16
From: "Martin Weiser" <weiser_at_[hidden]>
> Although the abstraction penalty is mostly insignificant compared to
> equivalent handwritten code, there remains a huge performance gap to
> optimized numerical kernels. I've done a preliminary comparison of dgemm
> performance with different matrix sizes N (see
> http://www.zib.de/weiser/ublas_review.gif), measuring the flops/s
> multiplying two square matrices of size N (N=1,...,1000). The competitors
> are a vendor optimized BLAS implementation (Sun Performance Library) and
> a straightforward naive "C" implementation. The environtment is GCC 3.1
> (options -O3 -funroll-loops) on UltraSparc 10/300.
This is perhaps a little off topic, but related to uBLA optimisation and
measuring true performance.
A factor of ten is a factor of ten, however...
Everyone has always used these observed FLOPpy charts for comparison,
however I think this is wrong.
If I did n x n matrix multiplication in n^4 operations (ridiculous, but hold
on) with efficient use of the FPU(s) I might get great flops but lousy
Strassen's algorithm is n^(log7 = 2.81), but has higher overhead and is thus
considered unsuitable for smaller n according to my text in front of me.
Coppersmith and Winograd have an algorithm that goes at O(2.376).
For n = 200 we are looking at 27 and 9.8 times the number of floating point
operations for naive n^3 and Strassen over Coppersmith and Winograd. This
ignores some possibly large constants.
So I think benchmarking performance measuring FLOPS is noble in trying to
get throughput clean but fraught with danger w.r.t. to true matrix
multiplication throughput. For example, in the chart above, the uBLA
implementation could be faster at doing the matrix calc than the Sun BLAs if
uBLA was 27 times more efficient at FLOP rationing than the 10 times FLOP
throughput advantaged Sun BLA.
Perhaps a better measure would be:
n x n matrices per second * n ^ 3 = Matt's Notional FLOPs.
Then again, perhaps you are going this already as the easiest thing to
measure is just what I put above... If so I'm just going on about nothing.
Just a nit in some respects, but it may lead to an implementation that
chooses different implementations for various sizes or decompositions just
as a good sort implementation tends to do.
Another $0.02 from some one rude enough not to compile and run the code (OK
i will this weekend after I coach my kid's soccer team to another huge
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk