Boost logo

Boost :

From: Steven Ross (spreadsort_at_[hidden])
Date: 2008-07-22 10:40:12


On Sat, Jul 12, 2008 at 11:20 AM, Phil Endecott <
spam_from_boost_dev_at_[hidden]> wrote:

> We want to move element A to the position currently occupied by element B.
> We could swap them, but moving B to A's position is probably not useful.
> So let's first move B to a good position for it, and then move A into its
> space. Of course some other element C will be occupying the position where
> we want to put B, so we have to keep going until we find an element that
> wants to be in the position where A was.
>
> The problem with that is that you might recurse a lot and fill your stack,
> so in practice you'd want to limit it to a few iterations and then give up
> and put a "wrong" value back at the source. Also you need to be certain
> that 'tmp' gets some sort of return value optimisation and isn't copied
> during each return. An explicitly unrolled iterative version would probably
> be best:
>
> Whether it actually improves performance will depend on quite how large the
> structs are. It certainly won't help with ints. So let's see your version
> that sorts structs!
>
> I found that an explicitly inlined swap was as fast as your 3-way swap and
multi-level loop (combined) on Intel systems for integers, but since the
3-way swap should be faster with larger data types, and the 3-way swap is
faster on my system, I'm running with it. I found that the primary
time-limiting steps are:
1) the random accesses used to find the "remote" element in swapping
2) bin lookups
The problem with a multi-way swap is that it risks wasting its last bin
lookup, and does no less bin lookups per item put in the correct place.
The random accesses aren't avoided either; the main speedup is from reducing
the processing overhead of the actual copy operation, which is minor with
ints. Since I saw a small but tangible speedup with a 3-way swap (but not
4-way or higher) I ran with it.
I made some other speed enhancements that are possible with a multi-level
loop, including splitting up the bin into separate pointers and sizes. It
would be nice if I could tell to cache the sizes first, then dump them from
core cache back to a lower-level cache and cache the pointers. I'm not
aware of any way to do so. This splitting optimization, in addition to
improving runtime, reduces memory overhead of the algorithm. I also think
it's more readable, as the meaning of the sizes doesn't change.

I'm not aware of any other optimizations to try; I've made all of the ones
that worked, and dumped the ones that didn't.
I tried an explicit binary-search-style roughlog2, but found that it
actually hurt performance (did not help). I think the control structures
are much slower than an at most 32-step very simple while loop.

If anyone has suggestions for improvement, I'm open to them. Otherwise, I
will consider this the final version of the core algorithm, and if
acceptable, will move on to adding functor support, float_sort, and
(eventually) string_sort.
I have seen a net 22% speed improvement relative to the first version I
posted on my system; a significant portion of this is due to Phil's
suggestions, so thanks Phil!

Steve





Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk