Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives
From: Smith, Jacob N (jacob.n.smith_at_[hidden])
Date: 2010-10-13 19:39:18

Next message: Joel de Guzman: "Re: [boost] Metaprogrammers, all of you!"
Previous message: Anthony Williams: "Re: [boost] [system][chrono] header-only libs"
In reply to: Jeffrey Lee Hellrung, Jr.: "Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives"
Next in thread: Simonson, Lucanus J: "Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives"
Reply: Simonson, Lucanus J: "Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives"
Reply: Joel.Falcou_at_[hidden]: "Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives"
Reply: David Abrahams: "Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives"

> -----Original Message-----
> From: boost-bounces_at_[hidden] [mailto:boost-
> bounces_at_[hidden]] On Behalf Of Jeffrey Lee Hellrung, Jr.
> Sent: Wednesday, October 13, 2010 1:56 PM
> To: boost_at_[hidden]
> Subject: Re: [boost] Accelerating algorithms with SIMD - Segmented
> iterators and alternatives
>
> On 10/13/2010 01:43 PM, Mathias Gaunard wrote:
> > [Repost: first one doesn't appear to have made it through]
> >
> > Just trying to refocus this thread to what I meant it to be.
> [...]
> > What I would like to know, is how people think we could integrate
> this
> > system into iterators and ranges so that existing algorithms could be
> > adapted to treat N elements at a time instead of 1, and therefore get
> a
> > potential speed gain.
> >
> > As I said, we currently have an iterator adapter that adapts a range
> of
> > properly aligned Ts, that are also contiguous at least in chunks of N
> > elements, into a range of pack<T, N>s.
> >
> > Are there other utilities we could provide to help with the usage of
> SIMD?
> > I was thinking of supporting adapting non-aligned ranges as well and
> > padding them with some values, but that is not possible to do
> > efficiently with the standard iterator model, which led to some
> > discussion about an alternative iterator model that seems to be going
> > nowhere.
>
> "seems" being the operative word ;)
>
> I find the argument to investigate a framework for internal iteration /
> push iteration / visitation iteration / whatever convincing. I haven't
> been convinced yet (despite Dave's best efforts) that the standard,
> external iteration model, segmented or otherwise, will absolutely
> address your efficiency concerns.

I've only had a chance to read the Austern paper "Segmented Iterators and Hierarchical Algorithms" once. I agree that the segmented iterators are rather appealing as an interface to packs-of-pods-datatype. However, I'm not sure the conceptual interface to segmented iterators are rich enough. Luke Simonson has more experience with the construction of an iterator pattern around SIMDs used in a high-performance system than I do. However I seem I to remember that the pattern for SIMD iteration always looked like this:

Loop A?
Loop B*
Loop C?

Where Loops A & C handle the unaligned start/end (if they exist) and Loop B* handles (0 or more of) the aligned packs of the sequence. Handling the edge-cases for A & C were definitely not the same --- there are memory exceptions that can occur at the end-of-range for Loop C that can't (or usually don't) occur for Loop A when doing unaligned loads/stores. I seem to remember that it was insufficient to just mark "not Loop B" but it was necessary to explicitly mark foreach operations as "Loop A" vs. "Loop C". It seems that this division of Loop A/B/C will need to be exposed through the iterator.

My other concern is that while it should be fairly straightforward to provide efficient SIMD instruction sequences for addition, subtraction, multiplication, (refined) division, and related fused-accumulate ops, many of the more "interesting" SIMD algorithms are highly sensitive to swizzling & permutation, prefetching, built-in conversion operations, etc.

A SIMD library exposing just (for instance) the Ring or Field algebraic operations does have its uses; however, I feel that in most "real world" scenarios the unexposable details needed for high-performance coding will mean that the library is only used for toy applications. That is, unless you think you can also define platform-independent forwarding to swizzle/permute/conditional-lane-usage/etc.

Another (particular ugly) use of SIMDs is to allow per-object aligned storage & load of structures into/outof binary array blobs. This would basically look like an allocator for an object that uses the SIMD extensions for loading/storing data-structures as binary blobs into and out of arrays. Have you considered an interface for this sort of usage?

> Have you benchmarked the performance difference between the standard
> iteration model and push iteration?
>
> - Jeff
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk