Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: nasos (nasos_i_at_[hidden])
Date: 2016-01-27 10:18:02

We would definitely like to adopt an implementation like this. We need
though to make sure:

1. It is called only if both arguments are dense matrices, or one dense
matrix and one dense vector. We should disable any implicit conversions
to avoid surprises like converting a sparse matrix to a dense one

2. We should provide extensive benchmarks for a large range of matrix

3. We should setup benchmarks that find optimal sizes for the micro and
meso kernels. These benchmarks may be slow and we may want to find a way
to publish optimal sizes for various architectures. Maybe just include
them in a readme file.

4. We should provide an interface to compile with these optimal sizes.
Provide fallback when these are not provided

5. ublas is one of the fastest libraries for small matrices. Because I
think the most prevalent use of dense multiplication is on small
matrices ( geometric transformations etc.) we need to make sure this
stays that way.

Concerning 3, please note that my experience is that there is not a
single combination that is optimal for all matrix sizes and this is
reflected in the previous developments in the fast matrix multiplication
branch on github. I hope we can find a good compromise.

I will create a branch with the aim to create an implementation that
embodies 1-4 and uses Michael's work. I apologize for not having a lot
of time to invest in this endeavor atm but I think if I make an outline
of the integration requirements we can work it out through pull
requests. When things are setup I will be running benchmarks on various
architectures and compilers ( currently I have access to clang, gcc,
intel and msvc)


On 01/26/2016 07:39 PM, Michael Lehn wrote:
> On 23 Jan 2016, at 18:53, palik imre <imre_palik_at_[hidden]
> <mailto:imre_palik_at_[hidden]>> wrote:
>> Hi All,
>> what's next? I mean what is the development process for ublas?
>> Now we have a C-like implementation that outperforms both the
>> mainline, and the branch version (axpy_prod). What will we do with that?
>> As far as I see we have the following options:
>> 1) Create a C++ template magic implementation out of it. But for
>> this, at the least we would need compile-time access to the target
>> instruction set. Any idea how to do that?
>> 2) Create a compiled library implementation out of it, and choose the
>> implementation run-time based on the CPU capabilities.
>> 3) Include some good defaults/defines, and hope the user will use them.
>> 4) Don't include it, and do something completely different.
>> What do you think?
> At the moment I am not sure, but pretty sure, that you don’t have to
> rewrite uBLAS to support good performance. uBLAS is a pretty fine
> library and already has what you need. So just extend it. At least
> for 383 lines of code :-)
> As I was pretty busy the last days, so I could not continue until
> today. I made some minimal modifications to adopt the GEMM implementation
> to take advantage of uBLAS:
> - The GEMM frame algorithm and the pack functions now work with any
> uBLAS matrix that supports element access through the notation A(i,j)
> - Only for matrix C in the operation C <- beta*C + alpha*A*B it
> requires that the matrix is stored row-major or col-major. The other
> matrices can
> be expressions. Here I need some help to get rid of this ugly lines of
> code:
> TC *C_ = &C(0,0);
> const size_type incRowC = &C(1,0) - &C(0,0);
> const size_type incColC = &C(0,1) - &C(0,0);
> - Things like the following should work without performance penalty
> (and in particularly without creating temporaries):
> blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)
> - And matrices A and B can be symmetric or whatever. Of course if A
> or B is triangular a specialized implementation can take advantage
> - The storage format does not matter. You can mix row-major and
> col-major, packed, …
> Here is the page for this:
> <>
> Please note, creating the benchmarks for the symmetric matrix-matrix
> product really takes long because the current uBLAS implementation seems
> to be much slower than for general matrices. So I reduced the problem
> size for now. It would be helpful if some people could reproduce the
> benchmarks
> and also check different cases:
> - different expressions
> - different element types, e.g. A with floats, B with double etc. At
> the moment there is only a micro kernel for double. The performance
> depends on the
> common type of A and B. So with complex<..> the reference
> implementation. But the performance should at least be constant in
> this case.
> It would in particular be nice to have benchmarks from people with an
> Intel Sandybridge or Haswell as the micro kernels are optimized for these
> architectures. If interested I can extend the benchmarks to compare
> with Intel MKL. For Linux there is a free non-commercial version
> available.
> Cheers,
> Michael
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]