Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-06-14 20:25:07
On 14/06/2011 23:38, David A. Greene wrote:
> Mathias Gaunard<mathias.gaunard_at_[hidden]> writes:
>> We generate something along the lines of
>> float tmp = 0.f;
>> for(int i ....)
>> tmp += d[i] + e[i];
>> for(int i ...)
>> f[i] = b[i] + 3 * c[i] + tmp;
> Will NT2 fuse the loops to get rid of the temporary?
Exactly how can you fuse the loops here?
This is actually an instance of splitting, where we extract things that
cannot/shouldn't be done in a single loop (or a single kernel).
> Does it do
> strip-mining or other such things (beyond that needed for
> vectorization)? Does NT2 try to generate a loop nest with the
> appropriate loops interchanged to improve performance?
Loops are in the cache-friendly order, obviously. Smarter things are
usually only done for higher-level abstractions than simple tables.
Loop fusion of different expressions is somewhat limited to what we
statically know about the sizes of the tables we're dealing with.
> I am really, really interested in this. Abstracting loops for HPC is a
> really good idea, in my mind. It would be best if there was an option
> to leave the resulting loops scalar in case the user wants to try to
> have the compiler vectorize them.
All components of the system are meant to be independent, so that you
can only use the part you want.