|
Boost : |
Subject: Re: [boost] [GSOC]SIMD Library
From: Joel Falcou (joel.falcou_at_[hidden])
Date: 2011-03-30 02:34:41
On 30/03/11 08:04, Gruenke, Matt wrote:
> My experience is mostly with MMX and integer SSE2. A useful approach I've used in the past was to create type-safe wrappers for the various intrinsics. These mostly took the form of overloaded inline functions, though I used templates whenever an immediate integer operand was required. These overloads enabled me to write higher-level templates that supported multiple vector types, even if they were sometimes machine-specific. The templates enabled optimizations of degenerate& special cases.
> Even though this isn't as sophisticated as what proto can do, I think it will be useful to have a fall-back for cases where there are either specialized instructions that aren't easily expressible as expressions or other cases where it's difficult to get proto to generate the instruction sequence you want. Besides type-safety, the wrappers make the code much more readable than using the native intrinsics.
we provide a native<Type,Extension> POD class not using proto for such
low level need. It is what makes the pack proto terminal data.
So basically, you dont want to think of extension then use pack, need a
precise low level stuff done, use native.
Basically on an Ix86 machine, pack<float> wraps a native<float,tag::sse_>.
> A few functions and templates implemented idioms and tricks for doing common tasks, like loading a vector with
> zeros (hint: xor anything with itself)
except on altivec where vec_splatu8(0) is faster :p
> template< typename V> V zero(); // generates a vector of 0's
> template< typename V> V full_mask(); // sets all bits to 1
we have constant generator for this and untyped proto-based constant
placeholders so :
pack<float> p(1,2,3,4), q = p * pi_;
does what you think it does :p
> template< int i, typename V> T get_element( V );
> template< int i, typename V> V set_element( V, T );
get_element is operator[] on pack and native. How do you do set_element,
every solution i found was either an UB or slow. By looking at your
prototype I guess you replicate V and change the element in a memory
buffer ? We ended up enforcing the fact SIMD vector are immutable at the
elementwise level. SSE4.x provides a extract/insert function we're
providing as a free function IIRC.
> template< int i, int j, ...> V shuffle( V ); // rearranges the elements in V.
we have that but we still ponder if it should be p.shuffle(mask) or
mapped over fusion::nview or somethign fancy like p.xzyw()
Note than on SSE2, shuffle is not permute and is not provided for all types.
The real powerful function is Altivec permute but it is harder to find a
proper abstraction of it.
> template< int n, typename T> V load( T * ); // loads n lowest elements of T[]
> template< int n, typename T> void store( V, T * ); // stores n lowest elements of V
load,store,splat are done. load and store are also masked as iterator
for pack and native if needed.
> template< typename V> void store_uncached( V, V * ); // avoids cache pollution
Does it make any real difference ? All tests I ran gave me minimal
amount of speed-up. I'm curious to hear your experience and add it if
needed.
> template< typename T, typename V> T horizontal_sum( V ); // sum of all elements in V
float k = sum(p) :p using add + shuffle on SSE < 3, hadd otherwise,
vec_sum on Altivec.
Note that if you want to go to sum element of a 1D vector, accumulate
using + on pack works out of the box and return you a pack. So
pack<flaot> s = std::accumulate(simd::begin(v),simd::end(v), zero_);
float r = sum(s);
gives you the sum of 1D vector in 2 lines.
> I'm also a fan of having a set of common, optimized 1-D operations, such as buffer packing/interleaving&
> unpacking/deinterleaving, extract/insert columns, convolution, dot-product, SAD, FFT, etc.
some are actually function working on the pack level. std::fold or
transform gets you to the 1D version. Some makes few sense. FFT in
SIDM in 1D is not low level for me and out of my league atm.
> Keep it low-level, though. IMO, any sort of high-level abstraction that ships data off to different accelerator back-ends,
> like GPUs, is a different animal and should go in a different library.
That's the goal of NT2 as a whole.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk