|
Boost : |
Subject: Re: [boost] sorting library proposal (Was: Review Wizard Status Report for June 2009)o
From: Steven Ross (spreadsort_at_[hidden])
Date: 2009-06-02 10:53:13
The documentation is in the algorithm_sorting.zip archive:
libs/algorithm/sorting/index.html
I'm not sure how you found the graphs but no documentation.
Tests were run with the sample applications, built with bjam, with random
numbers generated by the randomgen sample.
I just uploaded spreadsort_paper.pdf to the boost vault, for those who don't
have access to the paper through their library.
I test it with the libs/algorithm/sorting/tune.pl script, with the
-test_only option. To make this work you have to copy
boost/algorithm/sorting and libs/algorithm/sorting from the zip archive to
their respective locations in the boost release tree, and then tune.pl will
work. You're welcome to look at how the tuning/testing script works and
make suggested improvements; it takes a weighted average of runtime, giving
more weight to more random distributions.
Performance comparisons are done relative to std::sort. If you can
recommend a better algorithm to compare against, I'd be happy to oblige.
I compare the resulting sorted files to std::sort to make sure that the
results are identical in tune.pl, on a set of large randomly-generated data
sets.
Normal results for integer_sort are around 70%-2X speed improvement. The
high end mentioned is when the data allows for bucketsorting, on some
systems. I have a mostly-sorted data test (1% of the data is swapped out of
place) in my samples, as that appears to be one of the the
worst-relative-cases vs. std::sort, a slight slowdown on some systems, and a
speedup on others.
If you're interested in a similar, also-published algorithm (published a few
months later by a different author), there is adaptive left radix. It
doesn't have the worst-case handling and uses insertion_sort, but follows
the same basic concept. http://www.nik.no/2002/Maus.pdf
string_sort is highly similar to the American National Flag algorithm, which
was established as the best in-place high-speed string sorting algorithm a
while ago. The main differences with string_sort are:
1) Using std::sort instead of insertion_sort for small data sets, allowing
it to use 256 as the fallback size instead of 16, cutting back on worst-case
memory and runtime overhead.
2) Because of the increased fallback size, it was efficient to eliminate
finding the maximum and minimum values, saving an O(n) run through the data
for every iteration.
3) An optimization for long identical substrings, passing over the data to
check if every element is identical, and if so, checking the next character,
etc., so as to avoid wasting time with unnecessary swap loop tests and
recursion. This is a common issue with real strings. This was added at the
request of Phil Endecott, and performs well on real strings.
4) In the fallback sort, only comparing characters that haven't been sorted
yet, making the comparison-based fallback more efficient.
float_sort uses a reasonably well-known trick that solves a major
performance problem with x86 processors, but hasn't been widely accepted. I
think it would be good to make this more readily available.
The concept here is to provide a group of templated hybrid algorithms to
solve common sorting problems faster than purely comparison-based algorithms
can.
With regards to the question of whether Boost is interested in innovative
algorithms, that's for the Boost community to decide; I don't know.
On Mon, Jun 1, 2009 at 9:54 PM, Vladimir Prus <vladimir_at_[hidden]>wrote:
> Ronald Garcia wrote:
>
>
> > Sorting
> > -------
> > :Author: Steven Ross
> >
> > :Review Manager: Needed
> >
> > :Download: `Boost Vault
> > :<
> http://www.boostpro.com/vault/index.php?action=downloadfile&filename=algorithm_sorting.zip
> > >`__
> >
> > :Description:
> > A grouping of 3 templated hybrid radix/comparison-based sorting
> > algorithms that provide superior worst-case and average-case
> > performance to std::sort: integer_sort, which sorts fixed-size data
> > types that support a rightshift (default of >>) and a comparison
> > (default of <) operator. float_sort, which sorts standard
> > floating-point numbers by safely casting them to integers.
> > string_sort, which sorts variable-length data types, and is optimized
> > for 8-bit character strings.
> >
> > All 3 algorithms have O(n(k/s + s)) runtime where k is the number of
> > bits in the data type and s is a constant, and limited memory
> > overhead
> > (in the kB for realistic inputs). In testing, integer_sort varies
> > from 35% faster to 8X as fast as std::sort, depending on processor,
> > compiler optimizations, and data distribution. float_sort is roughly
> > 7X as fast as std::sort on x86 processors. string_sort is roughly 2X
> > as fast as std::sort.
>
> The archive pointed at above does not seem to have any documentation at
> all. I though that having
> docs is a requirement for having a review at all? It is also unclear how
> the performance
> measurements were made and the graphs obtained -- what test program should
> be run, with what
> compiler and compiler options and on what system. At the very minimum, this
> should be clearly
> documented so that everybody can obtain the same graphs, if necessary.
>
> However, I have a bigger concern. How do we know these algorithms are
> indeed correct,
> and how they stand in relation to other sorting algorithms? Steven has
> provided a link
> to his article:
>
> http://portal.acm.org/citation.cfm?id=691537
>
> However, it does not seem that full text is available for this article, so
> it's not
> possible to know what other algorithms spreadsort was compared to. I could
> not
> find references to this article from newer ones either, so no comparison
> results
> from there. Is Boost supposed to be the place for reviewing new algorithms,
> as opposed
> to reviewing new implementations of otherwise established algorithms?
>
> - Volodya
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk