
Ublas : 
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 20160121 17:23:06
Hi Nasos,
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on
http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
For a few cores (48) it can easily made multithreaded. For manycores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote
my own microkernels but the interface of function ugemm is compatible to BLIS.
Maybe you could help me to integrate your code in the benchmark example I posted above.
About Blaze: Do they have their own implementation of a matrixmatrix product? It seems to require a
tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrixmatrix product.
IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2. In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong,
let me know.
About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size. Depending on the architecture you can just specify ranges like 256  384 for
blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.
Cheers,
Michael
On 21 Jan 2016, at 22:41, nasos <nasos_i_at_[hidden]> wrote:
> Hello all,
> The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch).
>
> It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for nonsquare cases. In square cases it is more or less the same as typical  explicit vectorization  block based  algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable.
>
> There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few:
>
> 1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time.
>
> 2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues.
>
> 3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
>
> 4. I never run (3) to build a cascade using different number of threads on SMP machines.
>
> 5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s)
>
> I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2).
>
> As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation).
>
> Michael: Have you checked the performance against Blaze for a few nonsquare matrix multiplications?
>
>  Nasos
>
>
> On 01/20/2016 03:46 PM, Michael Lehn wrote:
>> So has anyone compared the performance yet?
>>
>>
>> On 20 Jan 2016, at 00:47, Joaquim Duran Comas <jdurancomas_at_[hidden]> wrote:
>>
>>> Hello,
>>>
>>> I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.
>>>
>>> Thanks and Best Regards,
>>> Joaquim Duran
>>>
>>>
>>> 20160119 15:41 GMT+01:00 Michael Lehn <michael.lehn_at_[hidden]>:
>>> Sorry for the inconvenience. I guess sending attachments to the mailing list is prohibited. I put the
>>> code on a website:
>>>
>>> http://apfel.mathematik.uniulm.de/~lehn/test_ublas/
>>>
>>>
>>>
>>> On 19 Jan 2016, at 15:14, palik imre <imre_palik_at_[hidden]> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> I cannot see any attachments ...
>>>>
>>>>
>>>> On Tuesday, 19 January 2016, 11:12, palik imre <imre_palik_at_[hidden]> wrote:
>>>>
>>>>
>>>> Is there a public git repo for ublas 2.0?
>>>>
>>>>
>>>> On Monday, 18 January 2016, 9:25, Oswin Krause <Oswin.Krause_at_[hidden]> wrote:
>>>>
>>>>
>>>> Hi Palik,
>>>>
>>>> this is a known problem. In your case you should already get better
>>>> performance when using axpy_prod instead of prod. There are currently
>>>> moves towards a ublas 2.0 which should make this a nonproblem in the
>>>> future.
>>>>
>>>>
>>>> On 20160117 21:23, palik imre wrote:
>>>> > Hi all,
>>>> >
>>>> > It seems that the matrix multiplication in ublas ends up with the
>>>> > trivial algorithm. On my machine, even the following function
>>>> > outperforms it for square matrices bigger than 173*173 (by a huge
>>>> > margin for matrices bigger than 190*190), while not performing
>>>> > considerably worse for smaller matrices:
>>>> >
>>>> > matrix<double>
>>>> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
>>>> > {
>>>> > assert(lhs.size2() == rhs.size1());
>>>> > matrix<double> rv(lhs.size1(), rhs.size2());
>>>> > matrix<double> r = trans(rhs);
>>>> > for (unsigned c = 0; c < rhs.size2(); c++)
>>>> > {
>>>> > matrix_column<matrix<double> > out(rv, c);
>>>> > matrix_row<matrix<double> > in(r, c);
>>>> > out = prod(lhs, in);
>>>> > }
>>>> > return rv;
>>>> > }
>>>> >
>>>> >
>>>> > Is there anybody working on improving the matrix multiplication
>>>> > performance?
>>>> >
>>>> > If not, then I can try to find some spare cycles ...
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Imre Palik
>>>>
>>>> > _______________________________________________
>>>> > ublas mailing list
>>>> > ublas_at_[hidden]
>>>> > http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>> > Sent to: Oswin.Krause_at_[hidden]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ublas mailing list
>>>> ublas_at_[hidden]
>>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>> Sent to: michael.lehn_at_[hidden]
>>>
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: jdurancomas_at_[hidden]
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: michael.lehn_at_[hidden]
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]