# Boost :

Subject: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-10-11 11:19:13

Hi,

I have, along with Joel Falcou, been looking a bit at how we can
integrate SIMD into iterator-based algorithms to accelerate them.
Unfortunately, there are a couple of issues.

The basic idea is to provide a range adaptor that adapts of range of
PODs, where N consecutive PODs are stored contiguously, into a range of
vectors (or "packs") of those PODs of size N.

std::vector<float> tab(20);
float sum = boost::accumulate(tab, 0);

we can write something like:

std::vector< float, simd::allocator<float> > tab(20);
float sum = simd::sum(
boost::accumulate(
simd::packed_range<4>(tab),
simd::zero_
)
);

Which does the same but using SSE instructions, adding four floats at a
time, and typically providing a 4x speed-up.
This can be applied to have the same result for any binary operation
which is commutative and associative (which unfortunately isn't the case
of many operations, but can still be useful).

Here, however, it works because 1) the memory is well aligned thanks to
the use of our custom allocator and 2) the size, 20, is a multiple of
the pack size, 4.

We'd be interested in providing a mechanism where we could also support
ranges of data aligned at an arbitrary address and of an arbitrary size.
The SSE would only be used in the middle, the borders being handled
element per element.

A first solution consists in padding the beginning and the end with zeroes.
Say you have 1 2 3 4 5 6 7
with only 3 being aligned well enough for SSE usage.
so we'd turn that into | 0 0 1 2 | 3 4 5 6 | 7 0 0 0 |.
However, it is very important that the code to deal with borders does
not affect the efficiency of the code in the middle. This is not
possible with the normal iterator/range design.

Our data is clearly cut into three segments: the beginning that is not
aligned, the data that is rightly aligned and sized, and the data at the
end that is not big enough to fill a whole pack.
Therefore, segmented iterators could provide us with a solution to do
this efficiently; but so could other solutions as well.

Wouldn't it be better to be able to use the mechanism so that you could
deal with the borders in an element-per-element way while you could deal
with the middle bit in a vector-per-vector way?
With C++ iterators, you pull data, but if we could use an iteration
paradigm where you push data instead, such as calling a function object
for each element, we could call an overloaded function that would be
able to accept both a single element and a pack.
Such a solution would also provide the speed benefits of segmented
iterators, and all high-level standard algorithms can be expressed with
this iteration paradigm just as well as with iterators.

I have already suggested such an alternative to segmented iterators
they broke compatibility, but they don't, they still maintain the legacy
non-segmented interface) there: