Hello all,
The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch).

It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable.

There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few:

1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time.

2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues.

3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).    

4. I never run (3) to build a cascade using different number of threads on SMP machines. 
 
5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s)

I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2).

As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation).

Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications?

- Nasos


On 01/20/2016 03:46 PM, Michael Lehn wrote:
So has anyone compared the performance yet?


On 20 Jan 2016, at 00:47, Joaquim Duran Comas <jdurancomas@gmail.com> wrote:

Hello,

I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.

Thanks and Best Regards,
Joaquim Duran


2016-01-19 15:41 GMT+01:00 Michael Lehn <michael.lehn@uni-ulm.de>:
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <imre_palik@yahoo.co.uk> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <imre_palik@yahoo.co.uk> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <Oswin.Krause@ruhr-uni-bochum.de> wrote:


Hi Palik,

this is a known problem. In your case you should already get better 
performance when using axpy_prod instead of prod. There are currently 
moves towards a ublas 2.0 which should make this a non-problem in the 
future.


On 2016-01-17 21:23, palik imre wrote:
> Hi all,
> 
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
> 
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
> 
> 
> Is there anybody working on improving the matrix multiplication 
> performance?
> 
> If not, then I can try to find some spare cycles ...
> 
> Cheers,
> 
> Imre Palik

> _______________________________________________
> ublas mailing list
> ublas@lists.boost.org
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: Oswin.Krause@ruhr-uni-bochum.de





_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de


_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: jdurancomas@gmail.com

_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de



_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: athanasios.iliopoulos.ctr.gr@nrl.navy.mil