Boost logo

Boost :

Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-18 21:25:12


On Tue Nov 18 2014 at 3:34:28 PM Julian Gonggrijp <j.gonggrijp_at_[hidden]>
wrote:

> > integer_sort and float_sort assume they are operating on something that
> can
> > efficiently be treated as a single word. They use bitshifting and
> > subtraction to find the difference between the most significant bits of
> two
> > elements. The bitshifting paradigm is what makes those algorithms
> > difficult to generalize to very large keys.
>
> Why is the bitshifting hard-coded into a separate version of the
> algorithm, rather than wrapping it in the Get_char parameter function?

>> Could you clarify the above remark? Why does the fully generic

>> string_sort not work well for keys that fit into a machine word?
> >
> > It's not quite as efficient; it works 8 bits at a time instead of their
> 11,
> > and doesn't use subtraction to eliminate bits that are identical between
> > the min and max.
>
> If I understand correctly, the substraction is a trick to reduce the
> number of buckets. Wouldn’t that be useful for strings as well? And why
> wouldn’t the number of bits per pass be parametric (with sensible
> defaults of course)?
>

You can bit-shift in Get_char; the main difference is the subtraction
(which requires a scan through all data for the minimum and maximum).
string_sort uses a fixed-size window based on the maximum possible range of
the unsigned character Get_char returns, plus a bin for empties. It is
explicitly designed and optimized for datatypes like the string, where
there is a sequence of small data types and subtraction will provide little
benefit, and where all characters at the same index are sometimes equal.
People can bit-shift inside Get_char to extract chunks of data, but the
subtraction optimization doesn't make sense when you're just sorting bytes,
and only have 259 buckets (which easily fit in L1 cache).

> > 3) If there is a character that is not allowed in the substrings (ideally
> > null character), then you can just return 0 when you hit the end of a
> > substring, and if the illegal character is not null, then shift any
> values
> > below that character up by 1 in the Get_char functor to fit this change
> > in. Embedded nulls usually are not a design requirement, in which case
> you
> > can just use null to signal the end of a substring.
> > 4) If all your possible character values are accounted for, then you can
> > add an extra character every step to signal whether the substring is
> over.
> > Example returns from Get_char with this approach on a string with 2
> > substrings:
> > """a"-> 0, 1, 'a', 0
> > "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'
> > If you can define a strict weak ordering that is usable in std::sort, you
> > can define a synthetic radix as I've just given an example of.
>
> You can invent all kinds of workarounds and invasive additions to account
> for additional use cases. That doesn’t mean it would be a good idea.
> There is no single sophistication to spreadsort that will make it work
> for *all* use cases that every comparison sort works out of the box for.
> I think it would be better to appreciate the elegance of the algorithm
> for the use cases that it easily supports, than to try to approach the
> generality of a comparison sort.
>

I think there is some form of miscommunication. string_sort, out of the
box, can sort ANY data with a strict weak ordering that std::sort can,
depending on how the user of the library defines their Get_char functor,
just like they have to define a compare functor in these more complex use
cases for both comparison-based algorithms and string_sort. You posed a
problem, and I provided 2 solutions by changing the user's Get_char
functor, one efficient one for the standard case where embedded nulls
aren't an issue, and another that supports embedded nulls with some more
effort. This technique can be extended to any std::sortable data type with
a strict weak ordering, merely by reproducing the actual bytes the
comparison function compares as returns from the Get_char function at
sequential indices. There is no loss of generality with hybrid-radix
sorting.
What I will agree is that it becomes more complicated for the user to
define a Get_char functor in addition to the compare functor for some use
cases, and in some cases it won't be worth the bother.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk