Boost logo

Ublas :

Subject: Re: [ublas] Status of development /Benchmarks
From: Karl Rupp (rupp_at_[hidden])
Date: 2013-12-09 13:13:20


> the kernel fusion idea is indeed something that must be considered. I
> have seen some of the work you have done with respect to similar
> implementations and I can say they are impressive. In the specific case
> though I don't find the reason why this shouldn't work with small/large
> containers or combinations of those. You think that such a fusion
> operation cannot be done in a compile-time implementation that would
> benefit small (static) containers?

I never said that it didn't work with a compile-time implementation in
place. However, in order to really carry out such fusions, you need to
have some runtime engine available in order to keep track of the state.
Expression templates alone can do amazing things, but only for a single
statement. Now the 'problem' is that at some point you need to decide on
whether you want to feed that runtime engine, or whether you want to run
the computation using expression templates straight away. In best case
the costs are only one if somewhere in the implementation of
operator=(), but that may already be too much for something like

Best regards,

> On 12/09/2013 11:08 AM, Karl Rupp wrote:
>> Hi Nasos,
>> > I am not so sure the requirements for smal /large containers are so
>>> diverse. After all you can have compile-time dispatching for small
>>> (static) or large (dynamic) containers if you want to use different
>>> algorithms for each or for mixed cases. Can you please elaborate if I am
>>> not getting it right?
>> Sure, here's one example: Consider
>> y = prod(A, x);
>> alpha = inner_prod(x, y);
>> with sparse matrix A. This is fairly common in iterative solvers (e.g.
>> conjugate gradients). With a 'standard' expression template execution,
>> you'll load x and y twice from memory, which can be a substantial
>> overhead if A is very sparse (say, a 2D finite difference stencil).
>> However, the operation can also be carried out as follows:
>> alpha = 0;
>> for (rows i) {
>> y[i] = 0;
>> for (cols j)
>> y[i] += A(i,j) * x[j];
>> alpha += y[i] * x[i];
>> }
>> This way, y[i] is only loaded to cache once, and x[i] has been loaded
>> to cache in the loop over j if A(i,i) is nonzero.
>> In a massively parallel setting this fusion results in one global
>> synchronization (kernel launch, message passing, etc.) instead of two
>> for each of the operations.
>> Best regards,
>> Karli
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> Sent to: rupp_at_[hidden]