|
Boost : |
Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-18 21:26:52
259->257 buckets. Sorry for the typo.
On Tue Nov 18 2014 at 9:25:12 PM Steven Ross <spreadsort_at_[hidden]> wrote:
> On Tue Nov 18 2014 at 3:34:28 PM Julian Gonggrijp <j.gonggrijp_at_[hidden]>
> wrote:
>
>> > integer_sort and float_sort assume they are operating on something that
>> can
>> > efficiently be treated as a single word. They use bitshifting and
>> > subtraction to find the difference between the most significant bits of
>> two
>> > elements. The bitshifting paradigm is what makes those algorithms
>> > difficult to generalize to very large keys.
>>
>> Why is the bitshifting hard-coded into a separate version of the
>> algorithm, rather than wrapping it in the Get_char parameter function?
>
> >> Could you clarify the above remark? Why does the fully generic
>
> >> string_sort not work well for keys that fit into a machine word?
>> >
>> > It's not quite as efficient; it works 8 bits at a time instead of their
>> 11,
>> > and doesn't use subtraction to eliminate bits that are identical between
>> > the min and max.
>>
>> If I understand correctly, the substraction is a trick to reduce the
>> number of buckets. Wouldnât that be useful for strings as well? And why
>> wouldnât the number of bits per pass be parametric (with sensible
>> defaults of course)?
>>
>
> You can bit-shift in Get_char; the main difference is the subtraction
> (which requires a scan through all data for the minimum and maximum).
> string_sort uses a fixed-size window based on the maximum possible range of
> the unsigned character Get_char returns, plus a bin for empties. It is
> explicitly designed and optimized for datatypes like the string, where
> there is a sequence of small data types and subtraction will provide little
> benefit, and where all characters at the same index are sometimes equal.
> People can bit-shift inside Get_char to extract chunks of data, but the
> subtraction optimization doesn't make sense when you're just sorting bytes,
> and only have 259 buckets (which easily fit in L1 cache).
>
>
>> > 3) If there is a character that is not allowed in the substrings
>> (ideally
>> > null character), then you can just return 0 when you hit the end of a
>> > substring, and if the illegal character is not null, then shift any
>> values
>> > below that character up by 1 in the Get_char functor to fit this change
>> > in. Embedded nulls usually are not a design requirement, in which case
>> you
>> > can just use null to signal the end of a substring.
>> > 4) If all your possible character values are accounted for, then you can
>> > add an extra character every step to signal whether the substring is
>> over.
>> > Example returns from Get_char with this approach on a string with 2
>> > substrings:
>> > """a"-> 0, 1, 'a', 0
>> > "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'
>>
> > If you can define a strict weak ordering that is usable in std::sort, you
>> > can define a synthetic radix as I've just given an example of.
>>
>> You can invent all kinds of workarounds and invasive additions to account
>> for additional use cases. That doesnât mean it would be a good idea.
>> There is no single sophistication to spreadsort that will make it work
>> for *all* use cases that every comparison sort works out of the box for.
>> I think it would be better to appreciate the elegance of the algorithm
>> for the use cases that it easily supports, than to try to approach the
>> generality of a comparison sort.
>>
>
> I think there is some form of miscommunication. string_sort, out of the
> box, can sort ANY data with a strict weak ordering that std::sort can,
> depending on how the user of the library defines their Get_char functor,
> just like they have to define a compare functor in these more complex use
> cases for both comparison-based algorithms and string_sort. You posed a
> problem, and I provided 2 solutions by changing the user's Get_char
> functor, one efficient one for the standard case where embedded nulls
> aren't an issue, and another that supports embedded nulls with some more
> effort. This technique can be extended to any std::sortable data type with
> a strict weak ordering, merely by reproducing the actual bytes the
> comparison function compares as returns from the Get_char function at
> sequential indices. There is no loss of generality with hybrid-radix
> sorting.
> What I will agree is that it becomes more complicated for the user to
> define a Get_char functor in addition to the compare functor for some use
> cases, and in some cases it won't be worth the bother.
>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk