Hi,
I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.
Just a footnote to that point. Once you have a fast GEMM the rest of BLAS level 3 is not such a long road. In particular
all the performance of SYMM, HEMM, TRMM, SYRK, … just depends on the performance of the ugemm micro kernel.
For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
Multiplication is done clockwise by packing blocks first in buffers blockA and blockB. If the elements of A are stored
in the upper triangular part you have three cases:
(1) The block is completely in the upper part and you pack it using the GEMM-pack
(2) The block is completely in the lower part and you pack its transposed using the GEMM-pack
(3) The block is on the diagonal. So you need an extra SYMM-pack routine that also packs the “virtual” elements (so 20 lines of code)
But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).