|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: Joaquim Duran Comas (jdurancomas_at_[hidden])
Date: 2016-01-27 20:09:10
Referring to the point 1:
1. It is called only if both arguments are dense matrices, or one dense
matrix and one dense vector. We should disable any implicit conversions to
avoid surprises like converting a sparse matrix to a dense one
I think that it should be possible uncompress partially spare matrices in
small buffers, use SIMD for calculations and then assign the result to
result matrix.
Joaquim Duran
2016-01-27 16:18 GMT+01:00 nasos <nasos_i_at_[hidden]>:
> We would definitely like to adopt an implementation like this. We need
> though to make sure:
>
> 1. It is called only if both arguments are dense matrices, or one dense
> matrix and one dense vector. We should disable any implicit conversions to
> avoid surprises like converting a sparse matrix to a dense one
>
> 2. We should provide extensive benchmarks for a large range of matrix
> sizes.
>
> 3. We should setup benchmarks that find optimal sizes for the micro and
> meso kernels. These benchmarks may be slow and we may want to find a way to
> publish optimal sizes for various architectures. Maybe just include them in
> a readme file.
>
> 4. We should provide an interface to compile with these optimal sizes.
> Provide fallback when these are not provided
>
> 5. ublas is one of the fastest libraries for small matrices. Because I
> think the most prevalent use of dense multiplication is on small matrices (
> geometric transformations etc.) we need to make sure this stays that way.
>
> Concerning 3, please note that my experience is that there is not a single
> combination that is optimal for all matrix sizes and this is reflected in
> the previous developments in the fast matrix multiplication branch on
> github. I hope we can find a good compromise.
>
> I will create a branch with the aim to create an implementation that
> embodies 1-4 and uses Michael's work. I apologize for not having a lot of
> time to invest in this endeavor atm but I think if I make an outline of the
> integration requirements we can work it out through pull requests. When
> things are setup I will be running benchmarks on various architectures and
> compilers ( currently I have access to clang, gcc, intel and msvc)
>
> -Nasos
>
>
>
> On 01/26/2016 07:39 PM, Michael Lehn wrote:
>
>
> On 23 Jan 2016, at 18:53, palik imre < <imre_palik_at_[hidden]>
> imre_palik_at_[hidden]> wrote:
>
> Hi All,
>
> what's next? I mean what is the development process for ublas?
>
> Now we have a C-like implementation that outperforms both the mainline,
> and the branch version (axpy_prod). What will we do with that?
>
> As far as I see we have the following options:
>
> 1) Create a C++ template magic implementation out of it. But for this, at
> the least we would need compile-time access to the target instruction set.
> Any idea how to do that?
>
> 2) Create a compiled library implementation out of it, and choose the
> implementation run-time based on the CPU capabilities.
>
> 3) Include some good defaults/defines, and hope the user will use them.
>
> 4) Don't include it, and do something completely different.
>
>
> What do you think?
>
>
>
> At the moment I am not sure, but pretty sure, that you donât have to
> rewrite uBLAS to support good performance. uBLAS is a pretty fine
> library and already has what you need. So just extend it. At least for
> 383 lines of code :-)
>
> As I was pretty busy the last days, so I could not continue until today.
> I made some minimal modifications to adopt the GEMM implementation
> to take advantage of uBLAS:
>
> - The GEMM frame algorithm and the pack functions now work with any uBLAS
> matrix that supports element access through the notation A(i,j)
> - Only for matrix C in the operation C <- beta*C + alpha*A*B it requires
> that the matrix is stored row-major or col-major. The other matrices can
> be expressions. Here I need some help to get rid of this ugly lines of
> code:
>
> TC *C_ = &C(0,0);
> const size_type incRowC = &C(1,0) - &C(0,0);
> const size_type incColC = &C(0,1) - &C(0,0);
>
> - Things like the following should work without performance penalty (and
> in particularly without creating temporaries):
> blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)
>
> - And matrices A and B can be symmetric or whatever. Of course if A or B
> is triangular a specialized implementation can take advantage
> - The storage format does not matter. You can mix row-major and
> col-major, packed, â¦
>
> Here is the page for this:
>
> <http://www.mathematik.uni-ulm.de/%7Elehn/test_ublas/session4/page01.html>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html
>
> Please note, creating the benchmarks for the symmetric matrix-matrix
> product really takes long because the current uBLAS implementation seems
> to be much slower than for general matrices. So I reduced the problem
> size for now. It would be helpful if some people could reproduce the
> benchmarks
> and also check different cases:
>
> - different expressions
> - different element types, e.g. A with floats, B with double etc. At the
> moment there is only a micro kernel for double. The performance depends on
> the
> common type of A and B. So with complex<..> the reference
> implementation. But the performance should at least be constant in this
> case.
>
> It would in particular be nice to have benchmarks from people with an
> Intel Sandybridge or Haswell as the micro kernels are optimized for these
> architectures. If interested I can extend the benchmarks to compare with
> Intel MKL. For Linux there is a free non-commercial version available.
>
>
> Cheers,
>
> Michael
>
>
>
>
>
>
>
> _______________________________________________
> ublas mailing listublas_at_[hidden]http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]
>
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: jdurancomas_at_[hidden]
>