On 24 Jan 2016, at 10:18, Oswin Krause <Oswin.Krause@ruhr-uni-bochum.de> wrote:

Hi,

I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.


Just a footnote to that point.  Once you have a fast GEMM the rest of BLAS level 3 is not such a long road.  In particular
all the performance of SYMM, HEMM, TRMM, SYRK, … just depends on the performance of the ugemm micro kernel.

For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
Multiplication is done clockwise by packing blocks first in buffers blockA and blockB.  If the elements of A are stored
in the upper triangular part you have three cases:

(1) The block is completely in the upper part and you pack it using the GEMM-pack
(2) The block is completely in the lower part and you pack its transposed using the GEMM-pack
(3) The block is on the diagonal.  So you need an extra SYMM-pack routine that also packs the “virtual” elements (so 20 lines of code)

But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).



Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:
Hi All,
what's next?  I mean what is the development process for ublas?
Now we have a C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod).  What will we do with
that?
As far as I see we have the following options:
1) Create a C++ template magic implementation out of it.  But for
this, at the least we would need compile-time access to the target
instruction set.  Any idea how to do that?
2) Create a compiled library implementation out of it, and choose the
implementation run-time based on the CPU capabilities.
3) Include some good defaults/defines, and hope the user will use
them.
4) Don't include it, and do something completely different.
What do you think?
Cheers,
Imre
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: Oswin.Krause@ruhr-uni-bochum.de
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de