Boost logo

Boost :

Subject: Re: [boost] Stacking iterators vs. dataflow
From: Giovanni Piero Deretta (gpderetta_at_[hidden])
Date: 2008-09-05 02:30:42


On Thu, Sep 4, 2008 at 4:11 AM, David Abrahams <dave_at_[hidden]> wrote:
>
> on Wed Sep 03 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:
>
>> On Thu, Sep 4, 2008 at 1:48 AM, David Abrahams <dave_at_[hidden]> wrote:
>>>
>>> on Wed Sep 03 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:
>>>
>>>> Given an appropriate buffering size (a memory page?) you could hide
>>>> the buffering step inside an interator adaptor, which, instead of
>>>> producing every N'th value on the fly, would batch the production of
>>>> enough elements to fill the buffer.
>>>>
>>>> David: BTW, I think that you can use exactly the same abstraction used
>>>> for segmented iterators to expose the buffering capability of a
>>>> buffered iterator adaptor.
>>>
>>> Yes, an iterator with a backing buffer would work great as a segmented
>>> iterator.
>>>
>>
>> So, do you think that buffering could be a good approach to help
>> reduce the abstraction overhead of stacked iterator adapters?
>
> I guess I haven't caught on to your line of thinking. I certainly don't
> see how buffering could be specifically useful when iterator adaptations
> are nested.
>

[this is a bit of a brainstorming, the idea is still quite fuzzy even
in my mind, so feel free to ignore me]

The idea is: stacked iterators are nice because they are a form of
loop fusion, plus are lazy.

Problem: the fused loop is inside out, and requires formidable
compiler effort to put it in a more straightforward form, especially
if you have many levels of nesting.

Now, why is loop fusion good?
1) you perform your algorithm in a single pass instead of multiple
passes (i.e. you do more stuff at every step). Thus the loop overhead
is reduced.
2) as any layer in the stack consumes an elment produced by a layer
above it, thus performing all operations for each element in sequence
often lead to better cache locality.
3) no need to create intermediate values which implies no need for
dynamic memory allocation and most importantly, a smaller working set.

Now, 1 is debatable: with modern cpus, just reducing loop overhead is
counterproductive, the thigter the loop, the faster is executed (Intel
Core 2 cpus for example are especially optimized for loops that fit in
~64 bytes). In theory a very large loop could even fall off the code
cache, but that would be a very degenerate case. Anyways, loop fusion
is not necessarily a must have. This proposal still allows it in many
cases.

OTOH 2 and 3 are desirable, we do not want to execute every whole
logical pass sequentially, interleaving their execution is fundamental
for good performance. But the interleaving granularity is not
necessarily a single element.

This is where buffering enters into play.

Let's make a variant of the copy_n algorithm a customization point for
an iterator range. This is the signature: OIter copy_n(Range& range,
size_t &n, OIter o) (note that both range and n are passed by non
const reference). And this is the default implementation :

template<class MultiPassInputRange, class OIter>
OIter copy_n(MultiPassInputRange& r, size_t& n, OIter o) {
     auto begin = r.begin();
     for(; n && begin!= end ;++ begin, --n, ++o)
            o* = *begin;

   r = MultiPassInputRange(begin, r.end());
     return o;
}

[ there can be faster specializations for random access ranges and for
pointer ranges of course]

Note that you can implement standard algorithms that do a single
traversal of a complete range, in term of copy_n: all copy variants
are straight forward; for_each just requires a
function_output_iterator; so does accumulate, except that the functor
is stateful; count is just a variant of accumulate. You can even some
implement mutating algorithms, but I do not think you can no longer
call OIter just an output Iterator (*o = *begin can actually change
*begin).

Those algorithms that do early exits, would actually traverse a
constant number of elements more than necessary (at most K-1, where K
is the buffering size; strictly speaking I do not think that such an
implementation would be conforming because the standard states the
exact number of comparison).

But what If I want to iterate my stacked ranges with a plain old for
loop? Well, you could fill a vector with all the elements of the range
stack, but would be wasteful.
Better yet is to add a buffer range a the bottom of the stack. When
you first access the begin of your range stack, the buffer range uses
copy_n to fill in its internal (fixed size) buffer from the range
right above it, then it returns elements from the buffer until it
reaches the end, at which point it fills it again (here segmented
iterators may help reduce the overhead of having to check the for end
of buffer at every step).

Using a small fixed buffer and periodically filling it is better than
filling a while vector because you gain the advantages of the above
points 2 and 3.

Now, a specific range can specialize copy_n to make it optimal;
This is a filter iterator copy_n:

void filter_helper(Oiter& o, Pred p, Value &v) {
    if(p(v)) *o++ = v;
}

template<....>
OIter copy_n(filter_range<Base, Pred>& r, size_t& n, OIter o) {
        while(n > 0)
               copy_n(r.base, n, function_output_iterator
                 (bind(filter_helper, ref(o), ref(r.p), _1)))
        return o;
}

map_iterator is really trivial and is left as an exercise for the reader.

Everything works well as long as you have to iterate over a single
range. If you have to iterate over two ranges, you need to buffer one
range and perform copy_n on the other.

void zip_helper(Oiter o, Iter begin, Value& v) {
     *o++ = make_tuple(*begin++, v);
}

template<...>
OIter copy_n(zip_range<R1, R2>& r, size_t& n, OIter o) {
       // We actually want to make this a boost::array and wrap
       // everything in a while(n>0)
       std::vector<R1::value_type> v(n);
       auto n2 = v,size();
       copy_n(r.base1, n2, v.begin());
       auto begin = v.begin();
       // assume both ranges have the same lenght
       copy_n(r.base2, n, function_output_iterator
                   (bind(zip_helper, ref(o), ref(begin), _1));
}

Both zip and filter still have considerable abstraction, but now all
the loops are in the right order, and all redundant computation should
be factored out. The compiler should be able to do a much better job
at optimizing.

I'm sure I'm missing plenty of corner cases, and I'm pretty sure the
abstraction only works well for linear traversal. It doesn't try to
optimize random access. I've completely ignored the possibility of
going backward. Also, by buffering ahead, you lose a bit of lazyness
(i.e. the user can no longer predict which operations will be
performed), but because buffering is bounded, you still have the
ability to traverse (part of) infinite ranges.

So what do you think? Has it any chance to actually work in practice?

-- 
gpd

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk