Hi,

there is one more thing i want to comment, and this is on the more serious side:

On 23.03.2013 16:15, Nasos Iliopoulos wrote:
David,
Since mdsd:array is a generic multi-dimensional container it is not bound to algebraic operations. I expect that with proper aligned memory allocation and SSE aglorithms (It is easy to add a custom storage container that supports that) it will be as fast as MKL, GotoBLAS, Eigen or armadillo. I believe that within that context, a GSOC project will need to include both the matrix container and the SSE algorithms tasks, or even AVX. (http://en.wikipedia.org/wiki/Advanced_Vector_Extensions)


and also the starting post from David itself:

On 23.03.2013 13:47, David Bellot wrote:
OK, the idea behind this is to have a clean framework to enable optimization based on SSE, Neon, multi-core, ... you name it.

Just to make this clear: in the current state of the library, SSE, AVX, multi core computation etc won't cut it as soon as the arguments involved are bigger than ~32KB. In this case, uBLAS performance is memory bound.Thus we will only wait more efficient for the next block of memory. And even if it were not, the way ublas is designed makes it impossible to use vectorization aside from the c-style functions like axpy_prod, which can in 99% of all relevant cases be mapped on  BLAS2/BLAS3 calls of the optimized C libraries(which give you AVX/SSE and OpenMP for free). If you expect that SSE helps you when computing your

A+=prod(B,C);

than you will be desperately disappointed in the current design.

Now maybe some of you are thinking: "But all fast linear algebra libraries are using SSE, so you must be wrong". Simple answer: these libraries are not memory bound as they optimize for that (you can experience this yourself by comparing the performance of copying a big matrix to transposing it. Than try the transposition block-wise: allocate a small buffer, say 16x16 elements, and than read 16x16 blocks from the matrix, write them transposed into the buffer and than copy the buffer to the correct spot in the target matrix. this gives a factor 7 speed-up on my machine. no SSE, no AVX.).

Don't trust me, trust the writer of the gotoblas library:

Goto, Kazushige, and Robert A. Geijn. "Anatomy of high-performance matrix multiplication." ACM Transactions on Mathematical Software (TOMS) 34.3 (2008): 12.

We all don't have enough time to implement fast linear algebra algorithms. Instead we should fall back to the numeric bindings as often as possible and use the power of expression templates to generate an optimal sequence of BLAS2/BLAS3 calls.

I would also like to part in that if it happens.

Greetings,
Oswin