Boost logo

Ublas :

Subject: Re: [ublas] Status of development /Benchmarks
From: Nasos Iliopoulos (nasos_i_at_[hidden])
Date: 2013-12-09 13:29:27

Ok, I see, indeed keeping track of the state of the expressions to be
fused is an issue.


On 12/09/2013 01:13 PM, Karl Rupp wrote:
> Hi,
> > the kernel fusion idea is indeed something that must be considered. I
>> have seen some of the work you have done with respect to similar
>> implementations and I can say they are impressive. In the specific case
>> though I don't find the reason why this shouldn't work with small/large
>> containers or combinations of those. You think that such a fusion
>> operation cannot be done in a compile-time implementation that would
>> benefit small (static) containers?
> I never said that it didn't work with a compile-time implementation in
> place. However, in order to really carry out such fusions, you need to
> have some runtime engine available in order to keep track of the
> state. Expression templates alone can do amazing things, but only for
> a single statement. Now the 'problem' is that at some point you need
> to decide on whether you want to feed that runtime engine, or whether
> you want to run the computation using expression templates straight
> away. In best case the costs are only one if somewhere in the
> implementation of operator=(), but that may already be too much for
> something like 2x2-matrices.
> Best regards,
> Karli
>> On 12/09/2013 11:08 AM, Karl Rupp wrote:
>>> Hi Nasos,
>>> > I am not so sure the requirements for smal /large containers are so
>>>> diverse. After all you can have compile-time dispatching for small
>>>> (static) or large (dynamic) containers if you want to use different
>>>> algorithms for each or for mixed cases. Can you please elaborate if
>>>> I am
>>>> not getting it right?
>>> Sure, here's one example: Consider
>>> y = prod(A, x);
>>> alpha = inner_prod(x, y);
>>> with sparse matrix A. This is fairly common in iterative solvers (e.g.
>>> conjugate gradients). With a 'standard' expression template execution,
>>> you'll load x and y twice from memory, which can be a substantial
>>> overhead if A is very sparse (say, a 2D finite difference stencil).
>>> However, the operation can also be carried out as follows:
>>> alpha = 0;
>>> for (rows i) {
>>> y[i] = 0;
>>> for (cols j)
>>> y[i] += A(i,j) * x[j];
>>> alpha += y[i] * x[i];
>>> }
>>> This way, y[i] is only loaded to cache once, and x[i] has been loaded
>>> to cache in the loop over j if A(i,i) is nonzero.
>>> In a massively parallel setting this fusion results in one global
>>> synchronization (kernel launch, message passing, etc.) instead of two
>>> for each of the operations.
>>> Best regards,
>>> Karli
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> Sent to: rupp_at_[hidden]
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]