Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-10 08:54:20

Next message: Jeremy Murphy: "Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19"
Previous message: Vladimir Prus: "Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19"
In reply to: Vladimir Prus: "Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19"
Next in thread: Vladimir Prus: "Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19"
Reply: Vladimir Prus: "Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19"

Vladimir,

> It's interesting, and probably would make a good addition to documentation.
> Though bzip appears to be phased out these days, replaced by xz, which does
> not
> use Burrowsâ€“Wheeler transformation.

Will do (in the development branch).

>> One quote from that article you cite: "Merge sort performs better than
>> radix sort for sorting keys of large sizes - such keys will be
>> required to accommodate the increasing cardinality of future
>> databases"
>> Merge sort is fundamentally less efficient than a smart hybrid-radix
>> approach when CPU sorting long keys because it has to compare the
>> entire length of the key up to the point of difference multiple times,
>> where the string_sort in this library only passes through each
>> character in the key once (or twice depending on how you count) during
>> the radix sorting portion, and then only compares what hasn't already
>> been passed through in the final comparison-sorting stage once the
>> problem has been cut down to less than a specific constant size.
>
>
> The problem is that it does not seem to me that Boost is a forum to
> discuss algorithm implementation. You propose a library that purports
> to do efficient sorting, based on 2002 paper. Were there no papers in
> subsequent 12 years with better algorithms?

I did an algorithm search in 2008 when preparing the boost submission.
ALR came out, fairly similar to my paper. I have not seen much
attention paid in academia to hybrid radix sorting.

>
>> Do you happen to have a copy of the full article? I'm happy to apply
>> any helpful concept to improve the library.
>
>
> The "full text" link at the top of that page works for me.

I hit a paywall at home, but accessed it at work. This is a
comparison of parallel LSD radix sort vs Mergesort. The conclusion is
that parallel LSD radix sort is slower with long keys. This is not
surprising. MSD hybrid radix sort (string_sort) does much better than
Mergesort or LSD radix sort with long keys. This library does not get
into parallel implementation issues; it can be used as a subsort
inside a parallel Mergesort, or you can split up the problem by MSD to
parallelize it and use this library to subsort.

>> With regards to classic algorithms, this library provides an
>> implementation similar to Adaptive Left Radix for integer_sort, which
>> came after it but was more widely available, with the addition of
>> better worst-case performance. It also provides an adaptation of
>> American Flag Sort (a classic algorithm) to C++ strings and with
>> better worst-case performance guarantees.
>
>
> I'm worried about "similar" and "adaptation" above. If I see that function
> whatever_sort
> implements an algorithm from paper N, with 200 citations, then I can assume
> enough
> people reviewed the algorithm. If that function implements an adaptation,
> then
> I'm not really sure what complexity I can expect.

Since you've read the overview, how about just looking through the
code and seeing if it works as described, or testing it? Can you ever
be sure a concrete implementation correctly applies an abstract
concept without looking at the implementation? These implementations
are fast for the same reason ALR and AFS are. They have superior
worst-case performance because they fall back to std::sort at a size
that optimizes worst-case performance. The only other difference is
that these algorithms scan to see if they can shrink the range each
iteration, which improves performance for non-worst-case distributions
with little impact on worst-case distributions.

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk