|
Boost : |
From: Joerg Walter (jhr.walter_at_[hidden])
Date: 2003-05-15 16:57:38
Hi Csaba,
you wrote:
> Once I played with comparing Intel Math kernel with some implementations I
> did.
> I found out that loop unrolling may double the speed.
> Still I could never quite get close to the speech of the Intel Math kernel
> (for large matrices), presumably due to insufficient caching.
> The above applies to row-major matrices.
> For column-major matrices loop unrolling achieved the same speed as the
> intel math kernel.
I've been playing with loop unrolling in the past, too (see
BOOST_UBLAS_USE_DUFF_DEVICE), but never found a satisfactory solution of
that performance problem. Compilers seem to be sufficiently stressed by the
templated code.
> Below is the code if anyone wants to give it a try.
> Maybe ublas could make use of some performance optimizations..
For small matrices I'm still waiting for the first (or next? ;-) compiler
to vectorize inlined template code (ICC the hottest candidate, never had a
chance to check KAI). For larger matrices I've been playing with some crude
high level optimizations, see
http://groups.yahoo.com/group/ublas-dev/message/461
I don't know, if they're really useful.
> (better it
> should be connected
> to some optimized blas implementations..?)
Yep. Either low level (using explicit bindings) or high level (using
specialized evaluators). Both discussed in the past and still undecided.
Thanks,
Joerg
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk