Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-28 15:47:35

On 28 Jan 2016, at 21:15, Riccardo Rossi <rrossi_at_[hidden]> wrote:

> i am impressed. 6* on a cuadcore!!

Thanks, but actually two quad cores ;-)

And with more then 6 threads it requires a more fine gained method to scale well. You have to consider
groups-hierarchies of threads. E.g. one group is responsible of packing a block and afterwards multiplying
it multithreaded. At the moment its like one group with to many members.

> do you also do sparse linear algebra by chance?

Sorry, not directly. I just looked at libraries like SuperLU and Umfpack. However, not as close as to other BLAS libraries. But
from my impression this also could be done much more elegant in C++. The big headache in these libraries is that they basically
have the same code for float, double, complex<float> and complex<double> . Just using C++ as "C plus function templates” would
make it much easier. And the performance relevant part in these libraries is again a fast dense BLAS.