Boost logo

Boost :

Subject: Re: [boost] [GSOC]SIMD Library
From: Gruenke, Matt (mgruenke_at_[hidden])
Date: 2011-04-12 06:13:37


On Wednesday, March 30, 2011 02:35, Joel Falcou wrote:

> On 30/03/11 08:04, Gruenke, Matt wrote:

[snip]

> > template< int i, typename V> T get_element( V );
> > template< int i, typename V> V set_element( V, T );
>
> get_element is operator[] on pack and native. How do you do
set_element,
> every solution i found was either an UB or slow.

I used shuffle, where possible. I think it's only supported for 16-bit
elements or larger, on MMX/SSE2. I don't remember if I implanted it
using shift, mask, and OR, for 8-bit, or if I just left it undefined for
8-bit.

> By looking at your
> prototype I guess you replicate V and change the element in a memory
> buffer ?

I'm pretty sure I avoided memory for just about everything but
initialization. I even went as far as circumventing the normal register
copy instruction, where possible, which was strangely slow on P4's.

> The real powerful function is Altivec permute but it is harder to find
a
> proper abstraction of it.

Perhaps you can at leas think of a way to use static assert to enforce
its inherent limitations. If permute's limitations are as the name
suggests, then you can use the element indices to set bits in a vector
and assert that all bits have been set. But maybe the compiler already
does that for you.

> > template< typename V> void store_uncached( V, V * );
// avoids > > cache pollution
>
> Does it make any real difference ? All tests I ran gave me minimal
> amount of speed-up. I'm curious to hear your experience and add it if
> needed.

Well, it's all about context. It doesn't make your writes faster. In
fact, small bursts will actually be slower. However, if you're
protecting something else in cache, then it can definitely pay off.

It should also improve hyperthreading performance (again, assuming
you're not going to read the written data for a while).

> > I'm also a fan of having a set of common, optimized 1-D operations,
> > such as buffer packing/interleaving&
> > unpacking/deinterleaving, extract/insert columns, convolution,
> > dot-product, SAD, FFT, etc.
>
> some are actually function working on the pack level. std::fold or
> transform gets you to the 1D version. Some makes few sense.

Often, I find the need to do things like de-interleave a scanline or
tile of data, do some processing on the channels, and then re-interleave
it. Processing at this granularity usually allows everything to stay in
L1 cache.

Efficient transpose (or at least extracting a batch of columns into
horizontal buffers) is also very important.

> > Keep it low-level, though. IMO, any sort of high-level
> > abstraction that ships data off to different accelerator
> > back-ends, like GPUs, is a different animal and should go
> > in a different library.
>
> That's the goal of NT2 as a whole.

That's a fine thing to do - just not something I want mixed into my SIMD
library. Since this is all about performance, whatever I use needs to
give me the option to drop down to the next lower level if I find it
necessary to get more performance in some hot spots.

Thank you for the work you're doing on this. I look forward to seeing
more.

Matt


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk