Boost logo

Boost :

Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Julian Gonggrijp (j.gonggrijp_at_[hidden])
Date: 2014-11-17 16:27:19


Steven Ross wrote:

>> I’m not convinced that string_sort, float_sort and integer_sort need to
>> be separate interfaces. I believe that the function to extract “digits”
>> from the values for the MSD part of the algorithm can be captured in a
>> fully generic way (that should be applicable to all radix-like sorting
>> algorithms), coupled with two additional, optional parameters (pre-
>> transform and post-transform) specifying functions for doing the
>> temporary “treat floats as integers” trick.
>
> string_sort is this fully generic approach. integer_sort and float_sort
> can be seen as specializations that allow bit-level granularity of the
> radix operations

Isn’t that a matter of defining indexing adapters for those types, taking
a template parameter for the bitfield size? I mean, I would expect any
radix-like sorting algorithm to take a function of type key -> index ->
digit as an optional parameter (defaulting to operator[] if not
provided). For floats and integers that function could be templated by
either the number of bits per digit or the number of desired buckets
(which is equivalent under exponentiation). Actually, that could be done
for strings and other bitstream-like types as well.

> and work well for keys that fit into a machine word.

Could you clarify the above remark? Why does the fully generic
string_sort not work well for keys that fit into a machine word?

> If
> you can think of some way to enable the sort-floats-as-ints trick as an
> option without harming performance for non-floats, I'd be very interested.

Default to the trivial (pre|post) transformation, and specialize the
algorithm that maps the transformations over the (input|output) range to
be a no-op for the trivial transformation. The compiler will eliminate
the function call.

>> 1. Value type can be treated as a sequence of digits/characters.
>
> Find me a type that you don't think this applies to,

For starters, any product type (tuple or struct) containing variable-
length subkeys. As an example, consider book records from a library
database that need to be ordered first by author and then by title. If
you insist on doing that with a (MSD) radix sort, you have two options:

 1. Use a stable radix sort and make multiple passes over the data. In
    most cases, this would be less efficient than just doing a single pass
    with a comparison sort, not to mention that spreadsort isn’t stable.
 2. Hack into the implementation details of the algorithm and recurse into
    the next key to be sorted by, before the buckets of the previous key
    are destroyed.

Case-agnostic string sorting is another example in which radix sorts
don’t excel (even though you can make it work if you extend the interface
considerably). This has been mentioned before.

In general, radix sorts only “shine” for types that can be injectively
mapped to the scalar bitfields while preserving order (even if you can
make it “work” for some other cases). Many common types and orders fit in
that category, but there are infinitely many types and orders with
different semantics.

> and I'll show you how
> it can. If std::sort can sort it, so can string_sort if you use the
> available functors properly. Maybe I should explain this in the
> documentation?

No, you should explicitly mention the opposite. Your algorithm has value,
not because it can be used for all imaginable scenarios but because it
has something to offer in particular use cases. No library needs to be
for everyone (not even std::sort, in fact). I really do think your
algorithm has value but you need to be very precise about what it can do
and what it cannot. Users will only appreciate your library more if you
do so.

>
> I will agreed to the extent that if performance isn't very critical
> relative to the code complexity of creating an efficient radix key
> expansion equivalent to the comparison operation, then people probably are
> better off with std::sort.
>
>> 2. Many keys to sort and/or very long keys.
>
> I agree, if most of your sorting runtime is going to spent on lists of <
> 1000 elements, this library is of little use, though it should also do
> little harm due to the fallback to std::sort. I'll edit the docs in
> develop to make that more clear.

To be honest I am sceptical that the algorithm even adds much value under
100K keys. So far, Frank Gennari has reported a 1.8x speedup for 100M
keys. Now, the effect probably would have been more dramatic if the
algorithm had been used for long strings rather than 32-bit integers, but
that’s still at 100M keys. The difference between an O(n log n) algorithm
and an O(nk) algorithm doesn’t grow particularly fast either (except for
very small n). Introsort/quicksort cache locality is also about as good
as MSD radix sort cache locality. I think the real added value of your
algorithm lies in the millions of keys and beyond (which is nothing to be
ashamed of!).

Please don’t take this as an attack, I really honestly believe that your
spreadshort library has value. I just think that you should present it
with a bit more modesty with regard to scope of application.

-Julian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk