|
Ublas : |
Subject: Re: [ublas] Status of development /Benchmarks
From: Karl Rupp (rupp_at_[hidden])
Date: 2013-12-09 11:08:45
Hi Nasos,
> I am not so sure the requirements for smal /large containers are so
> diverse. After all you can have compile-time dispatching for small
> (static) or large (dynamic) containers if you want to use different
> algorithms for each or for mixed cases. Can you please elaborate if I am
> not getting it right?
Sure, here's one example: Consider
y = prod(A, x);
alpha = inner_prod(x, y);
with sparse matrix A. This is fairly common in iterative solvers (e.g.
conjugate gradients). With a 'standard' expression template execution,
you'll load x and y twice from memory, which can be a substantial
overhead if A is very sparse (say, a 2D finite difference stencil).
However, the operation can also be carried out as follows:
alpha = 0;
for (rows i) {
y[i] = 0;
for (cols j)
y[i] += A(i,j) * x[j];
alpha += y[i] * x[i];
}
This way, y[i] is only loaded to cache once, and x[i] has been loaded to
cache in the loop over j if A(i,i) is nonzero.
In a massively parallel setting this fusion results in one global
synchronization (kernel launch, message passing, etc.) instead of two
for each of the operations.
Best regards,
Karli