Boost logo

Boost :

Subject: Re: [boost] sorting library proposal (Was: Review Wizard Status Report for June 2009)o
From: Steven Ross (spreadsort_at_[hidden])
Date: 2009-06-03 10:37:11


On Tue, Jun 2, 2009 at 11:02 PM, Jonathan Franklin <
franklin.jonathan_at_[hidden]> wrote:
>
>
> Radix sorting is indeed appropriate, IMO. However, there are many
> "proven" radix sort algorithms, and (probably) many more "unproven"
> algorithms. Seems to me that boost is a set of "generally useful"
> libraries, and not a "proving ground" for new algorithms.
>
> The point in question is then: At what point is an algorithm
> adequately "proven" for inclusion in a boost library?
>

There are multiple papers in the literature about Adaptive Left Radix, which
is fast on random data (and like Spreadsort, it wasn't published a day ago;
it's been over 6 years). ALR has inferior worst-case performance because it
uses insertion_sort as its fallback sort; worst-case performance is the top
reason to use Spreadsort. Is insertion_sort to std::sort such a big
complicated change that it can't be accepted?
The other main change is a rough estimate of worst-case performance at
runtime, and fallback when std::sort has better performance. With default
settings, this doesn't do much on 32-bit integers, but on systems where
std::sort is better optimized, with larger data types (64-bit), or with a
smaller cache it can improve performance.

Those are the biggest changes with this library vs. these other modern
Radix-sorting algorithms. I've described the changes in string_sort vs.
American Flag Sort already, which are similar, coming down to two core
changes: replacing insertion_sort with std::sort, and the identical
substring optimization. I'm not strongly attached to the substring
optimization, though it seems to work well, and handles a common
near-worst-case performance issue better. The change to not compare the
already-sorted portions of the strings is a simple and effective performance
tweak.

Going back to a higher level, it's easy to test a sorting algorithm for
correctness, and radix sorting isn't that hard to understand; the general
approach is not novel, just some optimizations and worst-case handling. It
would get expensive to publish a paper for every little tweak I've stuck in
a sorting algorithm, so depending on publications for every improvement is a
great way to slow down progress, especially if you demand that people cite
said publications.

Also, Knuth has a good book on sorting, and an excellent analysis of pure
comparison-based algorithms. People seem to get the impression from his
book that comparison-based sorting is the only general way to sort, but he
does have mention of hybrid algorithms, including "*Markku Tamminen*:
*Two*Levels are as
*Good as Any*. J. Algorithms 6: 138-144 (1985)" and the O(n) performance on
continuous integrable functions described on page 177 of The Art of Computer
Programming Volume 3. The current Spreadsort uses smaller chunks to be
cache-friendly, because Tamminem's approach is too slow when operating on n
> L1 cache size, and also because it cuts memory usage (as described in the
paper on ALR), but that just multiplies the number of iterations for
continuous integrable functions by log(n)/log(estimated L1 cache size).
Spreadsort's other main change is to use std::sort as a fallback, because
not all distributions are continuous and integrable, and the related
fallback check to minimize worst-case performance that takes into account
that std::sort can be faster on small lists if the distribution range is
large. (Tamminem's and the ALR approach both have poor absolute worst-case
performance).

In closing, this has:
insertion_sort -> std::sort
a worst-case fallback check (integer_sort/float_sort), falling back to
std::sort
an optimization for long substrings (string_sort)
float to integer cast (float_sort)
assorted performance tweaks

Are those changes really so dangerous?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk