|
Ublas : |
Subject: Re: [ublas] Status of development /Benchmarks
From: Nasos Iliopoulos (nasos_i_at_[hidden])
Date: 2013-12-09 11:23:11
Alright,
the kernel fusion idea is indeed something that must be considered. I
have seen some of the work you have done with respect to similar
implementations and I can say they are impressive. In the specific case
though I don't find the reason why this shouldn't work with small/large
containers or combinations of those. You think that such a fusion
operation cannot be done in a compile-time implementation that would
benefit small (static) containers?
-Nasos
On 12/09/2013 11:08 AM, Karl Rupp wrote:
> Hi Nasos,
>
> > I am not so sure the requirements for smal /large containers are so
>> diverse. After all you can have compile-time dispatching for small
>> (static) or large (dynamic) containers if you want to use different
>> algorithms for each or for mixed cases. Can you please elaborate if I am
>> not getting it right?
>
> Sure, here's one example: Consider
> y = prod(A, x);
> alpha = inner_prod(x, y);
> with sparse matrix A. This is fairly common in iterative solvers (e.g.
> conjugate gradients). With a 'standard' expression template execution,
> you'll load x and y twice from memory, which can be a substantial
> overhead if A is very sparse (say, a 2D finite difference stencil).
> However, the operation can also be carried out as follows:
> alpha = 0;
> for (rows i) {
> y[i] = 0;
> for (cols j)
> y[i] += A(i,j) * x[j];
> alpha += y[i] * x[i];
> }
> This way, y[i] is only loaded to cache once, and x[i] has been loaded
> to cache in the loop over j if A(i,i) is nonzero.
>
> In a massively parallel setting this fusion results in one global
> synchronization (kernel launch, message passing, etc.) instead of two
> for each of the operations.
>
> Best regards,
> Karli
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]