Boost logo

Ublas :

Subject: Re: [ublas] Status of development /Benchmarks
From: Karl Rupp (rupp_at_[hidden])
Date: 2013-12-09 11:08:45

Hi Nasos,

> I am not so sure the requirements for smal /large containers are so
> diverse. After all you can have compile-time dispatching for small
> (static) or large (dynamic) containers if you want to use different
> algorithms for each or for mixed cases. Can you please elaborate if I am
> not getting it right?

Sure, here's one example: Consider
  y = prod(A, x);
  alpha = inner_prod(x, y);
with sparse matrix A. This is fairly common in iterative solvers (e.g.
conjugate gradients). With a 'standard' expression template execution,
you'll load x and y twice from memory, which can be a substantial
overhead if A is very sparse (say, a 2D finite difference stencil).
However, the operation can also be carried out as follows:
  alpha = 0;
  for (rows i) {
   y[i] = 0;
   for (cols j)
     y[i] += A(i,j) * x[j];
   alpha += y[i] * x[i];
This way, y[i] is only loaded to cache once, and x[i] has been loaded to
cache in the loop over j if A(i,i) is nonzero.

In a massively parallel setting this fusion results in one global
synchronization (kernel launch, message passing, etc.) instead of two
for each of the operations.

Best regards,