|
Boost : |
From: Steven Ross (spreadsort_at_[hidden])
Date: 2008-06-28 19:46:51
Correction: I need the >> operator, not the << operator. It's used for
division. I apologize for the error.
First, Johan:
I would probably need to have three different methods:
integer_sort, float_sort, and string_sort. What I have right now is a C
non-templated version of integer_sort in the sourceforge Spreadsort project
, so I would start with a generic integer_sort. float_sort is only slightly
different.
2 and 3 could readily be handled by either adding >>, -, and < operators to
either the structure being sorted or a functor.
Phil:
You are correct, I do require a comparison operator too (<). I assumed that,
but it's good to make clear.
You are also correct that negative floats have to sorted differently. Also,
the IEEE float casting operation works on most OSes, but not all, so my
float_sort implementation would probably never make it to a standard. That
said, it would still be useful on most platforms, and it's very similar to
integer_sort, just requiring flipping the order of negative bins,
which should have a trivial performance impact.
I could sort fixed-point data, given a >> operator, which you should be able
to implement. Here is what is needed for a hybrid MSB radix sort and
introsort combo:
bool operator <(const T &) : for comparison
size_t operator -(const T & lhs, const T & rhs) : a - b > 0 -> b > a
Note: - will always be used such that lhs >= rhs inside the algorithm.
T operator >>(const T &) : to divide the values in a quick and efficient
manner that will preserve ordering (a >> n) < (b >> n) -> a < b.
Each item is allocated a bin number that corresponds to
(x >> n) - (minval >> n), where the algorithm picks an ideal n.
That's all that should be necessary for integer_sort based upon Spreadsort,
and for a cast float_sort.
string_sort is a different beast. I would probably specialize it to
strings and let the user specify a comparison operator initially,
adding more generality once it has proved its worse.
Yes, it should work well if there are long common substrings.
I'll leave that until after integer_sort.
As a little note Phil about lookups, if you keep the bin structure created
by Spreadsort around and don't delete any of the sub-bins, you can do
high-speed lookups using the bin structure that are much faster than binary
search (radix-based lookup, like a trie). Back when Spreadsort wasn't
cache-friendly and I split the data into n/8 bins, I found that doing lookups
using that bin structure on random data was 3-5X as fast as a binary search.
The idea works like this:
you have an array of bins over the range MIN to MAX
bin 0 points to the first value >= MIN
bin 1 points to the first value >= MIN + offset
bin 2 points to the first value >= MIN + 2*offset
...
bin n points to the first value >= MIN + n*offset
...
the last bin points to the first value >= MAX - offset
The you find the bin index for "x" by taking (x - MIN)/offset, and do a
binary search between that bin and the next bin.
This can be made a faster on integers by using the >> operator.
This kind of lookup will basically have one cache miss (into the big list of
bins), and then a binary_search inside of one bin, which doesn't normally
have many elements. If you use the Spreadsort-created bin structure,
you might be able to get away with zero cache misses and more evenly divided
bins, but you'll have more lookups. The most practical application would
probably be a flat lookup into n/C bins, where C is small enough that most
bins will fit in a memory page (256?).
But I digress. It sounds like there is interest in a generic integer_sort
and I will start working on one. After that float_sort should be simple.
string_sort will be a lot of work, but doable.
Steve
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk