Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: nasos (nasos_i_at_[hidden])
Date: 2016-01-21 18:28:36


Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:
> Hi Nasos,
>
> first of all I don’t want to take wrong credits and want to point out
> that this is not my algorithm. It is based on
>
> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>
> https://github.com/flame/blis
>
> For a few cores (4-8) it can easily made multithreaded. For
> many-cores like Intel Xeon Phi this is a bit more
> sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are "locked"
with icpc. Openmp is working properly though.

> The demo I posted does not use micro kernels that exploit SSE, AVX or
> FMA instructions. With that the matrix product is on par with Intel
> MKL. Just like BLIS. For my platforms I wrote
> my own micro-kernels but the interface of function ugemm is compatible
> to BLIS.
>
If you compile with -O3 I think you are getting near optimal SSE
vectorization. gcc is truly impressive and intel is even more.
> Maybe you could help me to integrate your code in the benchmark
> example I posted above.
>
I will try to find some time to spend on the code.
> About Blaze: Do they have their own implementation of a matrix-matrix
> product? It seems to require a
> tuned BLAS implementation (“Otherwise you get only poor performance”)
> for the matrix-matrix product.
I will check the benchmarks I run. I think I was using MKL with Blaze,
but Blaze is taking it a step further (I am not sure how) and they are
getting better performance than the underlying GEMM. Their benchmarks
indicate that they are faster than MKL
(https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)
> IMHO they only have tuned the “easy” stuff like BLAS Level1 and
> Level2. In that case it makes more
> sense to compare the performance with the actual underlying GEMM
> implementation. But if I am wrong,
> let me know.
>
> About the block size: In my experience you get better performance if
> you chose them dynamically at runtime
> depending on the problem size. Depending on the architecture you can
> just specify ranges like 256 - 384 for
> blocking factor MC. In my code it then also needs to satisfy the
> restriction that it can be divided by factor MR.
> I know that doing things at compile time is a C++ fetish. But the
> runtime overhead is negligible and having
> blocks of similar sizes easily pays of.
>
> Cheers,
>
> Michael
>
>
> On 21 Jan 2016, at 22:41, nasos <nasos_i_at_[hidden]
> <mailto:nasos_i_at_[hidden]>> wrote:
>
>> Hello all,
>> The algorithm in this branch is aiming at cascading kernels and is by
>> no means complete. Each kernel represents various sizes of the block
>> matrix used in the inner loop of the multiplication. The
>> multiplication is performed by picking the kernel that would provide
>> the highest performance for certain matrices dimensions ( I think
>> this has not been pushed in this branch).
>>
>> It is considerably faster (last time I checked) from typical block
>> algorithms that are using fixed size block sizes, especially for
>> non-square cases. In square cases it is more or less the same as
>> typical - explicit vectorization - block based - algorithms. This
>> applies when comparing to most linalg libraries out there. It is also
>> trivially paralellizable.
>>
>> There are some issues in fully bringing this into ublas, but they can
>> be probably dealt with. Here are a few:
>>
>> 1. Different CPU architectures require different block sizes and
>> different kernel cascade architectures. Hence there is the need for a
>> facility to define these at compile time.
>>
>> 2. Statically cascading the kernels end into large compilation times
>> and large executables. I haven't performed any benchmarks using
>> dynamic block sizes to avoid these issues.
>>
>> 3. I haven't been able to find a systematic way of figuring out the
>> cascade, but to run benchmark tests on each CPU architecture and
>> create an ordered list. These benchmarks indicate that there are
>> better block size choices than the ones that can be computed by the
>> rational in
>> (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
>>
>>
>> 4. I never run (3) to build a cascade using different number of
>> threads on SMP machines.
>>
>> 5. There is the need for proper introspection so that the product
>> operates only on ublas::matrix(es) and not generally on
>> matrix_expression(s)
>>
>> I am afraid I don't have time to fully realize the implementation at
>> the moment, especially on how to deal elegantly with (1) and (2).
>>
>> As a first start Michael Lehn's algorithm could be implemented (and
>> it is much much faster than the current simplistic implementation).
>>
>> Michael: Have you checked the performance against Blaze for a few
>> non-square matrix multiplications?
>>
>> - Nasos
>>
>>
>> On 01/20/2016 03:46 PM, Michael Lehn wrote:
>>> So has anyone compared the performance yet?
>>>
>>>
>>> On 20 Jan 2016, at 00:47, Joaquim Duran Comas
>>> <jdurancomas_at_[hidden]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I think that some development about a new product algorithm was
>>>> done in the
>>>> branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.
>>>>
>>>> Thanks and Best Regards,
>>>> Joaquim Duran
>>>>
>>>>
>>>> 2016-01-19 15:41 GMT+01:00 Michael Lehn<michael.lehn_at_[hidden]
>>>> <mailto:michael.lehn_at_[hidden]>>:
>>>>
>>>> Sorry for the inconvenience. I guess sending attachments to
>>>> the mailing list is prohibited. I put the
>>>> code on a website:
>>>>
>>>> http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/
>>>> <http://apfel.mathematik.uni-ulm.de/%7Elehn/test_ublas/>
>>>>
>>>>
>>>>
>>>> On 19 Jan 2016, at 15:14, palik imre <imre_palik_at_[hidden]
>>>> <mailto:imre_palik_at_[hidden]>> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> I cannot see any attachments ...
>>>>>
>>>>>
>>>>> On Tuesday, 19 January 2016, 11:12, palik imre
>>>>> <imre_palik_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>> Is there a public git repo for ublas 2.0?
>>>>>
>>>>>
>>>>> On Monday, 18 January 2016, 9:25, Oswin Krause
>>>>> <Oswin.Krause_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>> Hi Palik,
>>>>>
>>>>> this is a known problem. In your case you should already get
>>>>> better
>>>>> performance when using axpy_prod instead of prod. There are
>>>>> currently
>>>>> moves towards a ublas 2.0 which should make this a non-problem
>>>>> in the
>>>>> future.
>>>>>
>>>>>
>>>>> On 2016-01-17 21:23, palik imre wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > It seems that the matrix multiplication in ublas ends up
>>>>> with the
>>>>> > trivial algorithm. On my machine, even the following function
>>>>> > outperforms it for square matrices bigger than 173*173 (by a
>>>>> huge
>>>>> > margin for matrices bigger than 190*190), while not performing
>>>>> > considerably worse for smaller matrices:
>>>>> >
>>>>> > matrix<double>
>>>>> > matmul_byrow(const matrix<double> &lhs, const matrix<double>
>>>>> &rhs)
>>>>> > {
>>>>> > assert(lhs.size2() == rhs.size1());
>>>>> > matrix<double> rv(lhs.size1(), rhs.size2());
>>>>> > matrix<double> r = trans(rhs);
>>>>> > for (unsigned c = 0; c < rhs.size2(); c++)
>>>>> > {
>>>>> > matrix_column<matrix<double> > out(rv, c);
>>>>> > matrix_row<matrix<double> > in(r, c);
>>>>> > out = prod(lhs, in);
>>>>> > }
>>>>> > return rv;
>>>>> > }
>>>>> >
>>>>> >
>>>>> > Is there anybody working on improving the matrix multiplication
>>>>> > performance?
>>>>> >
>>>>> > If not, then I can try to find some spare cycles ...
>>>>> >
>>>>> > Cheers,
>>>>> >
>>>>> > Imre Palik
>>>>>
>>>>> > _______________________________________________
>>>>> > ublas mailing list
>>>>> >ublas_at_[hidden]
>>>>> >http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>>> > Sent to:Oswin.Krause_at_[hidden]
>>>>> <mailto:Oswin.Krause_at_[hidden]>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ublas mailing list
>>>>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>>>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>>> Sent to:michael.lehn_at_[hidden]
>>>>
>>>>
>>>> _______________________________________________
>>>> ublas mailing list
>>>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>> Sent to:jdurancomas_at_[hidden]
>>>>
>>>>
>>>> _______________________________________________
>>>> ublas mailing list
>>>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>> Sent to:michael.lehn_at_[hidden]
>>>
>>>
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to:athanasios.iliopoulos.ctr.gr_at_[hidden]
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]