Boost logo

Boost :

Subject: Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19
From: Vladimir Prus (vladimir_at_[hidden])
Date: 2014-11-10 08:19:15


Steven,

On 11/10/2014 02:46 PM, Steven Ross wrote:
>>> - What is your evaluation of the potential usefulness of the
>>> library?
>>
>>
>> I am not sure the library is useful. For very few users, sorting 100kb
>> arrays of integers is a performance
>> bottleneck. And if it's truly a performance bottleneck, one would probably
>> ponder cache architecture, or use of GPU, or
>> a zillion other tricks. Like this paper from 2010 appears to do, for
>> example:
>>
>> http://dl.acm.org/citation.cfm?doid=1807167.1807207
>
> With regards to practicality: I applied the string_sort algorithm to
> bzip and saw a 40% reduction in total compression time. Would you
> like to see the source for that?

It's interesting, and probably would make a good addition to documentation.
Though bzip appears to be phased out these days, replaced by xz, which does not
use Burrows–Wheeler transformation.

> There are problems where people sort billions of strings, integers,
> floats, or things that can be sorted like them, such as processing
> integrated circuit designs, where time is money.

Integrated circuit design appears to meet the "few users" criteria.
I doubt that companies who are in this business (such as my employer),
would benefit much from a Boost library.

> One quote from that article you cite: "Merge sort performs better than
> radix sort for sorting keys of large sizes - such keys will be
> required to accommodate the increasing cardinality of future
> databases"
> Merge sort is fundamentally less efficient than a smart hybrid-radix
> approach when CPU sorting long keys because it has to compare the
> entire length of the key up to the point of difference multiple times,
> where the string_sort in this library only passes through each
> character in the key once (or twice depending on how you count) during
> the radix sorting portion, and then only compares what hasn't already
> been passed through in the final comparison-sorting stage once the
> problem has been cut down to less than a specific constant size.

The problem is that it does not seem to me that Boost is a forum to
discuss algorithm implementation. You propose a library that purports
to do efficient sorting, based on 2002 paper. Were there no papers in
subsequent 12 years with better algorithms?

> Do you happen to have a copy of the full article? I'm happy to apply
> any helpful concept to improve the library.

The "full text" link at the top of that page works for me.

>>> And finally, every review should attempt to answer this question:
>>>
>>> - Do you think the library should be accepted as a Boost library?
>>
>>
>> I think the library should not be accepted. There's no reason to give Boost
>> approval to implementation of algorithm that
>> is neither accepted classic nor an obvious breakthrough.
>
> With regards to classic algorithms, this library provides an
> implementation similar to Adaptive Left Radix for integer_sort, which
> came after it but was more widely available, with the addition of
> better worst-case performance. It also provides an adaptation of
> American Flag Sort (a classic algorithm) to C++ strings and with
> better worst-case performance guarantees.

I'm worried about "similar" and "adaptation" above. If I see that function whatever_sort
implements an algorithm from paper N, with 200 citations, then I can assume enough
people reviewed the algorithm. If that function implements an adaptation, then
I'm not really sure what complexity I can expect.

> Did you read the overview section of the documentation that came with
> the library?

Yes, but I'm not sure it addresses my concerns.

-- 
Vladimir Prus
CodeSourcery / Mentor Embedded
http://vladimirprus.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk