|
Ublas : |
Subject: Re: [ublas] Status of development /Benchmarks
From: Nasos Iliopoulos (nasos_i_at_[hidden])
Date: 2013-12-09 13:29:27
Ok, I see, indeed keeping track of the state of the expressions to be
fused is an issue.
-Nasos
On 12/09/2013 01:13 PM, Karl Rupp wrote:
> Hi,
>
> > the kernel fusion idea is indeed something that must be considered. I
>> have seen some of the work you have done with respect to similar
>> implementations and I can say they are impressive. In the specific case
>> though I don't find the reason why this shouldn't work with small/large
>> containers or combinations of those. You think that such a fusion
>> operation cannot be done in a compile-time implementation that would
>> benefit small (static) containers?
>
> I never said that it didn't work with a compile-time implementation in
> place. However, in order to really carry out such fusions, you need to
> have some runtime engine available in order to keep track of the
> state. Expression templates alone can do amazing things, but only for
> a single statement. Now the 'problem' is that at some point you need
> to decide on whether you want to feed that runtime engine, or whether
> you want to run the computation using expression templates straight
> away. In best case the costs are only one if somewhere in the
> implementation of operator=(), but that may already be too much for
> something like 2x2-matrices.
>
> Best regards,
> Karli
>
>
>>
>> On 12/09/2013 11:08 AM, Karl Rupp wrote:
>>> Hi Nasos,
>>>
>>> > I am not so sure the requirements for smal /large containers are so
>>>> diverse. After all you can have compile-time dispatching for small
>>>> (static) or large (dynamic) containers if you want to use different
>>>> algorithms for each or for mixed cases. Can you please elaborate if
>>>> I am
>>>> not getting it right?
>>>
>>> Sure, here's one example: Consider
>>> y = prod(A, x);
>>> alpha = inner_prod(x, y);
>>> with sparse matrix A. This is fairly common in iterative solvers (e.g.
>>> conjugate gradients). With a 'standard' expression template execution,
>>> you'll load x and y twice from memory, which can be a substantial
>>> overhead if A is very sparse (say, a 2D finite difference stencil).
>>> However, the operation can also be carried out as follows:
>>> alpha = 0;
>>> for (rows i) {
>>> y[i] = 0;
>>> for (cols j)
>>> y[i] += A(i,j) * x[j];
>>> alpha += y[i] * x[i];
>>> }
>>> This way, y[i] is only loaded to cache once, and x[i] has been loaded
>>> to cache in the loop over j if A(i,i) is nonzero.
>>>
>>> In a massively parallel setting this fusion results in one global
>>> synchronization (kernel launch, message passing, etc.) instead of two
>>> for each of the operations.
>>>
>>> Best regards,
>>> Karli
>>>
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: rupp_at_[hidden]
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]