Boost logo

Boost :

Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Julian Gonggrijp (j.gonggrijp_at_[hidden])
Date: 2014-11-19 17:45:50


Steven Ross wrote:

> You can bit-shift in Get_char; the main difference is the subtraction
> (which requires a scan through all data for the minimum and maximum).
> string_sort uses a fixed-size window based on the maximum possible range of
> the unsigned character Get_char returns, plus a bin for empties. It is
> explicitly designed and optimized for datatypes like the string, where
> there is a sequence of small data types and subtraction will provide little
> benefit, and where all characters at the same index are sometimes equal.
> People can bit-shift inside Get_char to extract chunks of data, but the
> subtraction optimization doesn't make sense when you're just sorting bytes,
> and only have 259 buckets (which easily fit in L1 cache).

Thank you for your explanation.

>>> 3) If there is a character that is not allowed in the substrings (ideally
>>> null character), then you can just return 0 when you hit the end of a
>>> substring, and if the illegal character is not null, then shift any
>> values
>>> below that character up by 1 in the Get_char functor to fit this change
>>> in. Embedded nulls usually are not a design requirement, in which case
>> you
>>> can just use null to signal the end of a substring.
>>> 4) [...]
>>
>> You can invent all kinds of workarounds and invasive additions to account
>> for additional use cases. [...]
>> I think it would be better to appreciate the elegance of the algorithm
>> for the use cases that it easily supports, than to try to approach the
>> generality of a comparison sort.
>>
>
> I think there is some form of miscommunication. string_sort, out of the
> box, can sort ANY data with a strict weak ordering that std::sort can,
> depending on how the user of the library defines their Get_char functor,
> just like they have to define a compare functor in these more complex use
> cases for both comparison-based algorithms and string_sort. You posed a
> problem, and I provided 2 solutions by changing the user's Get_char
> functor, one efficient one for the standard case where embedded nulls
> aren't an issue, and another that supports embedded nulls with some more
> effort. This technique can be extended to any std::sortable data type with
> a strict weak ordering, merely by reproducing the actual bytes the
> comparison function compares as returns from the Get_char function at
> sequential indices. There is no loss of generality with hybrid-radix
> sorting.
> What I will agree is that it becomes more complicated for the user to
> define a Get_char functor in addition to the compare functor for some use
> cases, and in some cases it won't be worth the bother.

Yes, perhaps there is a miscommunication. Regarding the author/title
example, I understand the part where a library user can refine the
Get_char function to return a special value for the end of a subkey.
However, it is unclear to me how that could be enough to ensure that all
books are first sorted by author and then by title. Author names may be
as short as 10 characters or as long as 50 characters. How is Get_char
supposed to ensure all by itself that all first characters of all book
titles are considered in the same radix pass, i.e. the one after the last
character of the longest author name?

I naturally assumed that the sorting algorithm itself would have to help
here, which would be invasive for your implementation of spreadsort. I
realise that you never confirmed that, though.

By contrast, it is very clear how refining the comparison function would
be sufficient for a comparison sort. Just compare by author; in case of a
tie, compare by title.

To the best of my knowledge, it is common wisdom that radix sorts are not
as general as comparison sorts. I can be wrong. I may have missed an
important (new?) insight. If that is the case, please explain to me how
refining the Get_char function is sufficient to deal with any strict weak
ordering (just repeating the claim does not help me understand).

-Julian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk