
Ublas : 
Subject: Re: [ublas] Matrix multiplication performance
From: nasos (nasos_i_at_[hidden])
Date: 20160121 16:41:10
Hello all,
The algorithm in this branch is aiming at cascading kernels and is by no
means complete. Each kernel represents various sizes of the block matrix
used in the inner loop of the multiplication. The multiplication is
performed by picking the kernel that would provide the highest
performance for certain matrices dimensions ( I think this has not been
pushed in this branch).
It is considerably faster (last time I checked) from typical block
algorithms that are using fixed size block sizes, especially for
nonsquare cases. In square cases it is more or less the same as typical
 explicit vectorization  block based  algorithms. This applies when
comparing to most linalg libraries out there. It is also trivially
paralellizable.
There are some issues in fully bringing this into ublas, but they can be
probably dealt with. Here are a few:
1. Different CPU architectures require different block sizes and
different kernel cascade architectures. Hence there is the need for a
facility to define these at compile time.
2. Statically cascading the kernels end into large compilation times and
large executables. I haven't performed any benchmarks using dynamic
block sizes to avoid these issues.
3. I haven't been able to find a systematic way of figuring out the
cascade, but to run benchmark tests on each CPU architecture and create
an ordered list. These benchmarks indicate that there are better block
size choices than the ones that can be computed by the rational in
(http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
4. I never run (3) to build a cascade using different number of threads
on SMP machines.
5. There is the need for proper introspection so that the product
operates only on ublas::matrix(es) and not generally on matrix_expression(s)
I am afraid I don't have time to fully realize the implementation at the
moment, especially on how to deal elegantly with (1) and (2).
As a first start Michael Lehn's algorithm could be implemented (and it
is much much faster than the current simplistic implementation).
Michael: Have you checked the performance against Blaze for a few
nonsquare matrix multiplications?
 Nasos
On 01/20/2016 03:46 PM, Michael Lehn wrote:
> So has anyone compared the performance yet?
>
>
> On 20 Jan 2016, at 00:47, Joaquim Duran Comas <jdurancomas_at_[hidden]
> <mailto:jdurancomas_at_[hidden]>> wrote:
>
>> Hello,
>>
>> I think that some development about a new product algorithm was done
>> in the
>> branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.
>>
>> Thanks and Best Regards,
>> Joaquim Duran
>>
>>
>> 20160119 15:41 GMT+01:00 Michael Lehn<michael.lehn_at_[hidden]
>> <mailto:michael.lehn_at_[hidden]>>:
>>
>> Sorry for the inconvenience. I guess sending attachments to the
>> mailing list is prohibited. I put the
>> code on a website:
>>
>> http://apfel.mathematik.uniulm.de/~lehn/test_ublas/
>> <http://apfel.mathematik.uniulm.de/%7Elehn/test_ublas/>
>>
>>
>>
>> On 19 Jan 2016, at 15:14, palik imre <imre_palik_at_[hidden]
>> <mailto:imre_palik_at_[hidden]>> wrote:
>>
>>> Hi Michael,
>>>
>>> I cannot see any attachments ...
>>>
>>>
>>> On Tuesday, 19 January 2016, 11:12, palik imre
>>> <imre_palik_at_[hidden] <mailto:imre_palik_at_[hidden]>> wrote:
>>>
>>>
>>> Is there a public git repo for ublas 2.0?
>>>
>>>
>>> On Monday, 18 January 2016, 9:25, Oswin Krause
>>> <Oswin.Krause_at_[hidden]
>>> <mailto:Oswin.Krause_at_[hidden]>> wrote:
>>>
>>>
>>> Hi Palik,
>>>
>>> this is a known problem. In your case you should already get better
>>> performance when using axpy_prod instead of prod. There are
>>> currently
>>> moves towards a ublas 2.0 which should make this a nonproblem
>>> in the
>>> future.
>>>
>>>
>>> On 20160117 21:23, palik imre wrote:
>>> > Hi all,
>>> >
>>> > It seems that the matrix multiplication in ublas ends up with the
>>> > trivial algorithm. On my machine, even the following function
>>> > outperforms it for square matrices bigger than 173*173 (by a huge
>>> > margin for matrices bigger than 190*190), while not performing
>>> > considerably worse for smaller matrices:
>>> >
>>> > matrix<double>
>>> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
>>> > {
>>> > assert(lhs.size2() == rhs.size1());
>>> > matrix<double> rv(lhs.size1(), rhs.size2());
>>> > matrix<double> r = trans(rhs);
>>> > for (unsigned c = 0; c < rhs.size2(); c++)
>>> > {
>>> > matrix_column<matrix<double> > out(rv, c);
>>> > matrix_row<matrix<double> > in(r, c);
>>> > out = prod(lhs, in);
>>> > }
>>> > return rv;
>>> > }
>>> >
>>> >
>>> > Is there anybody working on improving the matrix multiplication
>>> > performance?
>>> >
>>> > If not, then I can try to find some spare cycles ...
>>> >
>>> > Cheers,
>>> >
>>> > Imre Palik
>>>
>>> > _______________________________________________
>>> > ublas mailing list
>>> >ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>>> >http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> > Sent to:Oswin.Krause_at_[hidden]
>>> <mailto:Oswin.Krause_at_[hidden]>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to:michael.lehn_at_[hidden] <mailto:michael.lehn_at_[hidden]>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:jdurancomas_at_[hidden] <mailto:jdurancomas_at_[hidden]>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to:michael.lehn_at_[hidden] <mailto:michael.lehn_at_[hidden]>
>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]