Boost logo

Boost :

Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Julian Gonggrijp (j.gonggrijp_at_[hidden])
Date: 2014-11-18 15:34:16


Steven Ross wrote:

> integer_sort and float_sort assume they are operating on something that can
> efficiently be treated as a single word. They use bitshifting and
> subtraction to find the difference between the most significant bits of two
> elements. The bitshifting paradigm is what makes those algorithms
> difficult to generalize to very large keys.

Why is the bitshifting hard-coded into a separate version of the
algorithm, rather than wrapping it in the Get_char parameter function?

(By the way, please forgive me for keeping firing such questions at you.
It’s just that your remarks raise questions which I think are
interesting. You are not obliged to defend your design, as far as I’m
concerned.)

>> Could you clarify the above remark? Why does the fully generic
>> string_sort not work well for keys that fit into a machine word?
>
> It's not quite as efficient; it works 8 bits at a time instead of their 11,
> and doesn't use subtraction to eliminate bits that are identical between
> the min and max.

If I understand correctly, the substraction is a trick to reduce the
number of buckets. Wouldn’t that be useful for strings as well? And why
wouldn’t the number of bits per pass be parametric (with sensible
defaults of course)?

>>>> 1. Value type can be treated as a sequence of digits/characters.
>>>
>>> Find me a type that you don't think this applies to,
>>
>> For starters, any product type (tuple or struct) containing variable-
>> length subkeys. As an example, consider book records from a library
>> database that need to be ordered first by author and then by title. If
>> you insist on doing that with a (MSD) radix sort, you have two options:
>>
>> 1. Use a stable radix sort and make multiple passes over the data. In
>> most cases, this would be less efficient than just doing a single pass
>> with a comparison sort, not to mention that spreadsort isn’t stable.
>> 2. Hack into the implementation details of the algorithm and recurse into
>> the next key to be sorted by, before the buckets of the previous key
>> are destroyed.
>
> Note: I am open to writing stable versions of these algorithms, potentially
> with TimSort or some other efficient stable fallback sort. All that is
> required for stability of the MSD radix portion is having a full copy of
> the data (just like LSD radix sort uses). The library should be faster
> without the complex cache-missing swaps the in-place version uses.

It is up to you whether you add a stable version; I would vote to accept
your library without it, too. Note that while a full copy of the data may
speed up the algorithm, that does not necessarily mean that it is
suddenly a good idea to make multiple passes with it.

> That
> said, here's another 2 options, the first of which is quite efficient:
>
> 3) If there is a character that is not allowed in the substrings (ideally
> null character), then you can just return 0 when you hit the end of a
> substring, and if the illegal character is not null, then shift any values
> below that character up by 1 in the Get_char functor to fit this change
> in. Embedded nulls usually are not a design requirement, in which case you
> can just use null to signal the end of a substring.
> 4) If all your possible character values are accounted for, then you can
> add an extra character every step to signal whether the substring is over.
> Example returns from Get_char with this approach on a string with 2
> substrings:
> """a"-> 0, 1, 'a', 0
> "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'
>
> If this is a serious issue, I can add substring end signaling into the API.

To the contrary. Please don’t do any such invasive thing to your
algorithm just to add one class of use cases. You algorithm works for a
large class of use cases out of the box and you can make many people
happy with it as it is.

> If you can define a strict weak ordering that is usable in std::sort, you
> can define a synthetic radix as I've just given an example of.

You can invent all kinds of workarounds and invasive additions to account
for additional use cases. That doesn’t mean it would be a good idea.
There is no single sophistication to spreadsort that will make it work
for *all* use cases that every comparison sort works out of the box for.
I think it would be better to appreciate the elegance of the algorithm
for the use cases that it easily supports, than to try to approach the
generality of a comparison sort.

Apart from that, at this point it seems clear to me that you agree that
spreadsort (and radix sorting in general) does not work for product types
with variable length subkeys *out of the box*, contrary to (pure)
comparison sorts. I must emphasise that this is still just one example
that I could easily come up with; you should not assume that it is the
only exception. Since spreadsort does not work out of the box for all use
cases, you should not pretend that it does in your documentation, either.

>> Case-agnostic string sorting is another example in which radix sorts
>> don’t excel (even though you can make it work if you extend the interface
>> considerably). This has been mentioned before.
>>
>
> The functor example works efficiently in this case, as mentioned before.
> It's significantly faster than std::sort using
> boost::algorithm::ilexicographical_compare for case-insensitive sorting:
> 1.21s for 40MB of strings vs 2.56s for std::sort.

In that case I retract that example. As I said I only quickly scanned the
previous discussion, so I hereby apologise for this oversight and any
other oversights I might have made.

-Julian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk