|
Boost : |
Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-19 18:46:26
Julian,
On Wed Nov 19 2014 at 5:46:01 PM Julian Gonggrijp <j.gonggrijp_at_[hidden]>
wrote:
> >>> 3) If there is a character that is not allowed in the substrings
> (ideally
> >>> null character), then you can just return 0 when you hit the end of a
> >>> substring, and if the illegal character is not null, then shift any
> >> values
> >>> below that character up by 1 in the Get_char functor to fit this change
> >>> in. Embedded nulls usually are not a design requirement, in which case
> >> you
> >>> can just use null to signal the end of a substring.
> >>> 4) [...]
> >>
> >> You can invent all kinds of workarounds and invasive additions to
> account
> >> for additional use cases. [...]
> >> I think it would be better to appreciate the elegance of the algorithm
> >> for the use cases that it easily supports, than to try to approach the
> >> generality of a comparison sort.
> >>
> >
> > I think there is some form of miscommunication. string_sort, out of the
> > box, can sort ANY data with a strict weak ordering that std::sort can,
> > depending on how the user of the library defines their Get_char functor,
> > just like they have to define a compare functor in these more complex use
> > cases for both comparison-based algorithms and string_sort. You posed a
> > problem, and I provided 2 solutions by changing the user's Get_char
> > functor, one efficient one for the standard case where embedded nulls
> > aren't an issue, and another that supports embedded nulls with some more
> > effort. This technique can be extended to any std::sortable data type
> with
> > a strict weak ordering, merely by reproducing the actual bytes the
> > comparison function compares as returns from the Get_char function at
> > sequential indices. There is no loss of generality with hybrid-radix
> > sorting.
> > What I will agree is that it becomes more complicated for the user to
> > define a Get_char functor in addition to the compare functor for some use
> > cases, and in some cases it won't be worth the bother.
>
> Yes, perhaps there is a miscommunication. Regarding the author/title
> example, I understand the part where a library user can refine the
> Get_char function to return a special value for the end of a subkey.
> However, it is unclear to me how that could be enough to ensure that all
> books are first sorted by author and then by title. Author names may be
> as short as 10 characters or as long as 50 characters. How is Get_char
> supposed to ensure all by itself that all first characters of all book
> titles are considered in the same radix pass, i.e. the one after the last
> character of the longest author name?
>
By alternating between returning payload radix characters and a "does this
field continue" byte. A 0 will specify an end of field, and sort lower in
the order than a 1, thus making all shorter names sort lower than names
that are equivalent to the same length but longer.
Thus, Author: Jon Title: bat will have GetChar output these values:
1,'J',1,'o',1,'n',0 ,1,'b',1,'a',1,'t',0
Author: Jonathan Title:(empty), will output these values:
1,'J',1,'o',1,'n',1,'a',1,'t',1,'h',1,'a',1,'n',0, 0
The 0 vs 1 will cause Jon to appear before Jonathan in order, because it is
shorter. Actually, the 0s and 1s aren't necessary for the last subkey (the
title), but I included them to show this approach.
There is also a simple GetLength functor that lets string_sort know when
all characters are exhausted.
>
> By contrast, it is very clear how refining the comparison function would
> be sufficient for a comparison sort. Just compare by author; in case of a
> tie, compare by title.
>
That's exactly what the Get_char method I explained is doing, but it's less
intuitive. The radix equivalent requires encoding the same information the
comparison function uses, but sometimes in a different byte format to
include factors such as sizes and signs. As I said, doable, but may not be
worth the effort.
> To the best of my knowledge, it is common wisdom that radix sorts are not
> as general as comparison sorts. I can be wrong. I may have missed an
> important (new?) insight. If that is the case, please explain to me how
> refining the Get_char function is sufficient to deal with any strict weak
> ordering (just repeating the claim does not help me understand).
>
> I agree with your statement that this is conventional wisdom.
2 major factors are included in this conventional wisdom:
1) Most Radix sorts perform poorly relative to comparison sorting for small
N, large K, or variable-length data. Hybrid-radix sorting resolves this
problem.
2) The mapping from a sorting requirement to a comparison function is more
intuitive and generally simpler than the mapping to a Get_char-type radix
implementation. With #1 limiting performance, and comparison-based sorts
being fast enough for most problems, I think most people just didn't put in
the effort to examine the possibility of generalized radix sorting.
string_sort provides this capability. It won't be optimal for many types
of problems, but it can be used for them and provides worst-case
performance guarantees.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk