Subject: Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-10 09:33:25
On Mon, Nov 10, 2014 at 9:07 AM, Vladimir Prus
> On 11/10/2014 04:54 PM, Steven Ross wrote:
>> I did an algorithm search in 2008 when preparing the boost submission.
>> ALR came out, fairly similar to my paper. I have not seen much
>> attention paid in academia to hybrid radix sorting.
> Okay, so assuming a third party were to implement a sorting algorithm for
> Why they would choose your paper over the 2008 paper?
What 2008 paper? ALR came out later in 2002.
The reason is better worst-case performance. Please see the
alrbreaker example for an example distribution which causes
conventional MSD radix sorts to perform very poorly; overhead starts
to dominate as they radix sort small chunks of data. This library
avoids that problem by switching to std::sort once it has broken up
the problem into small enough chunks that overhead will be a major
>>> >I'm worried about "similar" and "adaptation" above. If I see that
>>> > function
>>> >implements an algorithm from paper N, with 200 citations, then I can
>>> > assume
>>> >people reviewed the algorithm. If that function implements an
>>> > adaptation,
>>> >I'm not really sure what complexity I can expect.
>> Since you've read the overview, how about just looking through the
>> code and seeing if it works as described, or testing it? Can you ever
>> be sure a concrete implementation correctly applies an abstract
>> concept without looking at the implementation? These implementations
>> are fast for the same reason ALR and AFS are. They have superior
>> worst-case performance because they fall back to std::sort at a size
>> that optimizes worst-case performance. The only other difference is
>> that these algorithms scan to see if they can shrink the range each
>> iteration, which improves performance for non-worst-case distributions
>> with little impact on worst-case distributions.
> Which "these algorithms"? ALR and AFS? If so, it would mean they are better?
These implementations (integer_sort, float_sort, string_sort) scan to
see if they can shrink the range; integer_sort and float_sort find the
maximum and minimum using a fast scan across the chunk of data being
sorted, and MSD radix sort between the actual max and min. This
reduces the size of the problem in many common cases while adding only
minimal overhead in the worst case.
string_sort does something similar, checking to see if all characters
are identical, and if so, it skips on to the next character, only
sorting once it encounters an actual difference. This is useful in
cases such as hierarchical strings, where some chunks of the key are
all the same. Again, the overhead of this scan is minimal.