Boost logo

Boost :

Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-17 21:46:39


Julian,

On Mon Nov 17 2014 at 4:27:34 PM Julian Gonggrijp <j.gonggrijp_at_[hidden]>
wrote:

Steven Ross wrote:
> > string_sort is this fully generic approach. integer_sort and float_sort
> > can be seen as specializations that allow bit-level granularity of the
> > radix operations
>
> Isn’t that a matter of defining indexing adapters for those types, taking
> a template parameter for the bitfield size? I mean, I would expect any
> radix-like sorting algorithm to take a function of type key -> index ->
> digit as an optional parameter (defaulting to operator[] if not
> provided). For floats and integers that function could be templated by
> either the number of bits per digit or the number of desired buckets
> (which is equivalent under exponentiation). Actually, that could be done
> for strings and other bitstream-like types as well.
>

integer_sort and float_sort assume they are operating on something that can
efficiently be treated as a single word. They use bitshifting and
subtraction to find the difference between the most significant bits of two
elements. The bitshifting paradigm is what makes those algorithms
difficult to generalize to very large keys.

>
> > and work well for keys that fit into a machine word.
>
> Could you clarify the above remark? Why does the fully generic
> string_sort not work well for keys that fit into a machine word?
>

It's not quite as efficient; it works 8 bits at a time instead of their 11,
and doesn't use subtraction to eliminate bits that are identical between
the min and max.

> > If
> > you can think of some way to enable the sort-floats-as-ints trick as an
> > option without harming performance for non-floats, I'd be very
> interested.
>
> Default to the trivial (pre|post) transformation, and specialize the
> algorithm that maps the transformations over the (input|output) range to
> be a no-op for the trivial transformation. The compiler will eliminate
> the function call.
>

I'll look to see if I can structure it that way without loss of efficiency
or major loss in readability.

> >> 1. Value type can be treated as a sequence of digits/characters.
> >
> > Find me a type that you don't think this applies to,
>
> For starters, any product type (tuple or struct) containing variable-
> length subkeys. As an example, consider book records from a library
> database that need to be ordered first by author and then by title. If
> you insist on doing that with a (MSD) radix sort, you have two options:
>
> 1. Use a stable radix sort and make multiple passes over the data. In
> most cases, this would be less efficient than just doing a single pass
> with a comparison sort, not to mention that spreadsort isn’t stable.
> 2. Hack into the implementation details of the algorithm and recurse into
> the next key to be sorted by, before the buckets of the previous key
> are destroyed.
>

Note: I am open to writing stable versions of these algorithms, potentially
with TimSort or some other efficient stable fallback sort. All that is
required for stability of the MSD radix portion is having a full copy of
the data (just like LSD radix sort uses). The library should be faster
without the complex cache-missing swaps the in-place version uses. That
said, here's another 2 options, the first of which is quite efficient:

3) If there is a character that is not allowed in the substrings (ideally
null character), then you can just return 0 when you hit the end of a
substring, and if the illegal character is not null, then shift any values
below that character up by 1 in the Get_char functor to fit this change
in. Embedded nulls usually are not a design requirement, in which case you
can just use null to signal the end of a substring.
4) If all your possible character values are accounted for, then you can
add an extra character every step to signal whether the substring is over.
Example returns from Get_char with this approach on a string with 2
substrings:
"""a"-> 0, 1, 'a', 0
"bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'

If this is a serious issue, I can add substring end signaling into the API.

If you can define a strict weak ordering that is usable in std::sort, you
can define a synthetic radix as I've just given an example of.

Case-agnostic string sorting is another example in which radix sorts
> don’t excel (even though you can make it work if you extend the interface
> considerably). This has been mentioned before.
>

The functor example works efficiently in this case, as mentioned before.
It's significantly faster than std::sort using
boost::algorithm::ilexicographical_compare for case-insensitive sorting:
1.21s for 40MB of strings vs 2.56s for std::sort.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk