Boost logo

Boost :

Subject: Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2010-10-13 21:03:53


Smith, Jacob N wrote:
>> I find the argument to investigate a framework for internal
>> iteration / push iteration / visitation iteration / whatever
>> convincing. I haven't been convinced yet (despite Dave's best
>> efforts) that the standard, external iteration model, segmented or
>> otherwise, will absolutely address your efficiency concerns.
>
> I've only had a chance to read the Austern paper "Segmented Iterators
> and Hierarchical Algorithms" once. I agree that the segmented
> iterators are rather appealing as an interface to
> packs-of-pods-datatype. However, I'm not sure the conceptual
> interface to segmented iterators are rich enough. Luke Simonson has
> more experience with the construction of an iterator pattern around
> SIMDs used in a high-performance system than I do. However I seem I
> to remember that the pattern for SIMD iteration always looked like
> this:
>
> Loop A?
> Loop B*
> Loop C?
>
> Where Loops A & C handle the unaligned start/end (if they exist) and
> Loop B* handles (0 or more of) the aligned packs of the sequence.
> Handling the edge-cases for A & C were definitely not the same ---
> there are memory exceptions that can occur at the end-of-range for
> Loop C that can't (or usually don't) occur for Loop A when doing
> unaligned loads/stores. I seem to remember that it was insufficient
> to just mark "not Loop B" but it was necessary to explicitly mark
> foreach operations as "Loop A" vs. "Loop C". It seems that this
> division of Loop A/B/C will need to be exposed through the iterator.

It seems to me a segmented iterator could provide a generic way to, for example, vectorize an algorithm on a deque by letting me unpack the pointer and distance to end of the current contiguous address range. I could implement A?B*C? for each contiguous range without needing access to the internals of the dequeue data structure, however, the dequeue doesn't provide a segmented iterator interface, nor does anything else that I'm aware of, and I don't see that changing, so I don't see the point in implemented vector accelerated algorithms with segmented iterator based interfaces to work with all the segmented iteratior data types that don't exist and probably never will. My work with the vectorized A?B*C? using pointer based interface and generic call back functor argument showed that it could match the performance of hand crafted assembly with all of my loop unrolling, prefetching and similar handiwork. There was a lot of special casing for unaligned begin and end of source and destinatin address ranges. I made great use of it vectorizing some pretty complex general C-style code and got nice speedups for data that was unaligned and address ranges that were often quite small. It was a very productive abstraction and easy to apply. The benefit of abstracting away the complexity of memory access was a big win. Regular iterators didn't perform as well, but I did implement them and use them because they were better than nothing. I didn't look at segmented iterators. I think we might want a different abstraction than segmented iterators to vectorize algorithms that are abstracted away from data structures. Perhaps we should think in terms of arrays instead of scalar elements as the smallest unit. Abstractions and semantics are important exactly because they dictate how you think about a problem. I want simple abstractions and semantics that make the problem at hand easy to think about. So far I haven't heard anyone explain how a segmented iterator makes it easier to vectorize an algorithm and not harder.

Regards,
Luke


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk