|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-24 05:19:09
On 24 Jan 2016, at 10:18, Oswin Krause <Oswin.Krause_at_[hidden]> wrote:
> Hi,
>
> I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.
Just a footnote to that point. Once you have a fast GEMM the rest of BLAS level 3 is not such a long road. In particular
all the performance of SYMM, HEMM, TRMM, SYRK,
just depends on the performance of the ugemm micro kernel.
For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
Multiplication is done clockwise by packing blocks first in buffers blockA and blockB. If the elements of A are stored
in the upper triangular part you have three cases:
(1) The block is completely in the upper part and you pack it using the GEMM-pack
(2) The block is completely in the lower part and you pack its transposed using the GEMM-pack
(3) The block is on the diagonal. So you need an extra SYMM-pack routine that also packs the virtual elements (so 20 lines of code)
But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).
>
> Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).
>
>
> On 2016-01-23 18:53, palik imre wrote:
>> Hi All,
>> what's next? I mean what is the development process for ublas?
>> Now we have a C-like implementation that outperforms both the
>> mainline, and the branch version (axpy_prod). What will we do with
>> that?
>> As far as I see we have the following options:
>> 1) Create a C++ template magic implementation out of it. But for
>> this, at the least we would need compile-time access to the target
>> instruction set. Any idea how to do that?
>> 2) Create a compiled library implementation out of it, and choose the
>> implementation run-time based on the CPU capabilities.
>> 3) Include some good defaults/defines, and hope the user will use
>> them.
>> 4) Don't include it, and do something completely different.
>> What do you think?
>> Cheers,
>> Imre
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: Oswin.Krause_at_[hidden]
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]