Hi Nasos,Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on
For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit moresophisticated but still not too hard.
If you compile with -O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.The demo I posted does not use micro kernels that exploit SSE, AVX orFMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrotemy own micro-kernels but the interface of function ugemm is compatible to BLIS.
I will try to find some time to spend on the code.Maybe you could help me to integrate your code in the benchmark example I posted above.
I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require atuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.
IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2. In that case it makes moresense to compare the performance with the actual underlying GEMM implementation. But if I am wrong,let me know.
About the block size: In my experience you get better performance if you chose them dynamically at runtimedepending on the problem size. Depending on the architecture you can just specify ranges like 256 - 384 forblocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR.I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and havingblocks of similar sizes easily pays of.
Cheers,
Michael
On 21 Jan 2016, at 22:41, nasos <nasos_i@hotmail.com> wrote:
Hello all,_______________________________________________
The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch).
It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable.
There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few:
1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time.
2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues.
3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
4. I never run (3) to build a cascade using different number of threads on SMP machines.
5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s)
I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2).
As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation).
Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications?
- Nasos
On 01/20/2016 03:46 PM, Michael Lehn wrote:
So has anyone compared the performance yet?
On 20 Jan 2016, at 00:47, Joaquim Duran Comas <jdurancomas@gmail.com> wrote:
Hello,
I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.
Thanks and Best Regards,Joaquim Duran
_______________________________________________
2016-01-19 15:41 GMT+01:00 Michael Lehn <michael.lehn@uni-ulm.de>:
Sorry for the inconvenience. I guess sending attachments to the mailing list is prohibited. I put thecode on a website:
http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/
On 19 Jan 2016, at 15:14, palik imre <imre_palik@yahoo.co.uk> wrote:
Sent to: michael.lehn@uni-ulm.de_______________________________________________Hi Michael,
I cannot see any attachments ...
Is there a public git repo for ublas 2.0?
Hi Palik,
this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a non-problem in the
future.
On 2016-01-17 21:23, palik imre wrote:
> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm. On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
> assert(lhs.size2() == rhs.size1());
> matrix<double> rv(lhs.size1(), rhs.size2());
> matrix<double> r = trans(rhs);
> for (unsigned c = 0; c < rhs.size2(); c++)
> {
> matrix_column<matrix<double> > out(rv, c);
> matrix_row<matrix<double> > in(r, c);
> out = prod(lhs, in);
> }
> return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik
> _______________________________________________
> ublas mailing list
> ublas@lists.boost.org
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: Oswin.Krause@ruhr-uni-bochum.de
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: jdurancomas@gmail.com
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de
_______________________________________________ ublas mailing list ublas@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: athanasios.iliopoulos.ctr.gr@nrl.navy.mil
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de
_______________________________________________ ublas mailing list ublas@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: athanasios.iliopoulos.ctr.gr@nrl.navy.mil