Hi,
there is one more thing i want to comment, and this is on the more
serious side:
On 23.03.2013 16:15, Nasos Iliopoulos wrote:
David,
Since mdsd:array is a generic multi-dimensional container it is
not bound to algebraic operations. I expect that with proper
aligned memory allocation and SSE aglorithms (It is easy to add
a custom storage container that supports that) it will be as
fast as MKL, GotoBLAS, Eigen or armadillo. I believe that within
that context, a GSOC project will need to include both the
matrix container and the SSE algorithms tasks, or even AVX. (
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
and also the starting post from David itself:
On 23.03.2013 13:47, David Bellot
wrote:
OK, the idea behind this is to have a clean
framework to enable optimization based on
SSE, Neon, multi-core, ... you name it.
Just to make this clear: in the current state of the library, SSE,
AVX, multi core computation etc won't cut it as soon as the
arguments involved are bigger than ~32KB. In this case, uBLAS
performance is memory bound.Thus we will only wait more efficient
for the next block of memory. And even if it were not, the way ublas
is designed makes it impossible to use vectorization aside from the
c-style functions like axpy_prod, which can in 99% of all relevant
cases be mapped on BLAS2/BLAS3 calls of the optimized C
libraries(which give you AVX/SSE and OpenMP for free). If you expect
that SSE helps you when computing your
A+=prod(B,C);
than you will be desperately disappointed in the current design.
Now maybe some of you are thinking: "But all fast linear algebra
libraries are using SSE, so you must be wrong". Simple answer: these
libraries are not memory bound as they optimize for that (you can
experience this yourself by comparing the performance of copying a
big matrix to transposing it. Than try the transposition block-wise:
allocate a small buffer, say 16x16 elements, and than read 16x16
blocks from the matrix, write them transposed into the buffer and
than copy the buffer to the correct spot in the target matrix. this
gives a factor 7 speed-up on my machine. no SSE, no AVX.).
Don't trust me, trust the writer of the gotoblas library:
Goto, Kazushige, and Robert A. Geijn. "Anatomy of
high-performance matrix multiplication." ACM Transactions on
Mathematical Software (TOMS) 34.3 (2008): 12.
We all don't have enough time to implement fast linear algebra
algorithms. Instead we should fall back to the numeric bindings as
often as possible and use the power of expression templates to
generate an optimal sequence of BLAS2/BLAS3 calls.
I would also like to part in that if it happens.
Greetings,
Oswin