Subject: Re: [boost] [review] Formal review period for Sort library
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-11-11 20:25:20
Thanks for your feedback Phil.
> Is a "faster" serial sort algorithm useful?
> In 2008, Luke Simonson said:
> anyone interested in runtime of an application that
> spends a lot of time sorting large data volumes is
> going to do it on an up-to-date machine, which means
> 4, 8, 16 or 32 cores
> I think std::sort is a straw man to compare yourself
> to, not because it isn't good at what it does, but
> because the target use case in question of very large
> data volumes is better suited to a parallel algorithm.
> At that time I didn't entirely agree as I had some embedded applications
> that could benefit from faster sorting of perhaps tens of thousands of
> items. But in the intervening five years, even phones have gained
> multiple CPUs. Today, if I needed faster sorting my first thought would
> be to parallelise it.
Sure, you can gain speedup by parallelizing, but if you can use half as
many cores to get the same speedup by using a faster subsort, then isn't
On phones, the biggest issue is usually battery usage, which is why many of
them rarely use more than 1 CPU core at a time. On servers, frameworks
like MapReduce or other unshared memory techniques are commonly used to
distribute the problem in many smaller pieces across separate processes.
This library provides a faster subsort that will work with every standard
way to parallelize a CPU-based sorting problem except LSD radix sorting.
Parallel sorting can be valuable, but because of all the potential
applications, variations between memory, runtime, and CPU usage
constraints, and the level of shared-memory parallelization, there will be
large variation across applications in which solution is best, complicating
any library that attempts to meet these various needs. I'm not
fundamentally opposed to adding library calls for the specific case of
shared-memory parallel sorting, but it would add substantial complexity,
and not cover all parallel-sorting needs.
> Code Quality
> Back in 2008, the code had issues like calls to malloc() and free()
> and a lack of exception safety. No doubt it has improved, but perhaps
> any keen reviewers could look for any remaining issues.
There is now just a vector::resize of the bin_cache in the develop branch.
I looked at eliminating that by using (more) stack memory, but it slowed
down the library.
> Skipping common prefixes
> One undeniable complexity advantage of this algorithm compared to
> std::sort<string> is that it doesn't need to re-compare common
> string prefixes once it knows that they are equal. Intuitively,
> one would expect a substantial speedup from this change if the
> input contains many strings with common prefixes. But intuition
> is not always correct. I did some experiments using the pathnames
> of all the files on my computer as the test data; these certainly
> have many common prefixes e.g. /usr/local/include/boost/. I
> wrote a modified quicksort that would skip common prefixes (see
> http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh )
> and concluded that actually the intuition is wrong; comparing
> bytes can be done 8-at-a-time on a 64-bit system, and the cost of
> constantly re-comparing strings like /usr/local/include/boost/
> must be small. So the intuition that Steven's algorithm would
> be useful in applications where strings have long common prefixes
> should not be followed without first measuring it.
Highly compressible data is one such application, and it works well there
in bzip, but the length of the common prefixes were as much as hundreds of
With more conventional data, you are correct, it isn't much of an
advantage. I added that feature more to minimize the disadvantage created
by otherwise doing a full radix sort on each byte, when nothing is
different, which is a common issue with radix sorts on strings (if there
are better solutions, I'm open to them). The end result was consistently
faster than comparison-based sorting when fed real data, including the
hierarchical strings I tested with, but had no relative advantage on short
hierarchies like what you describe vs normal strings.
> Does anyone sort a vector<int> ?
> I don't think I've ever needed to sort a vector<int>; there is normally
> some associated data, e.g. I'm sorting a vector<struct> or vector<struct*>
> where one field of the struct is an int key. Or the key has multiple
> fields like an (x,y) pair, or there is some extra feature of the
> such as being case-insensitive.
> The proposed API does provide a reasonable way to sort these more complex
> containers, i.e. by providing additional functors that (for string_sort)
> return the nth character of the key string. But what we don't know is
> whether its performance benefits relative to introsort also apply in these
> cases: as far as I've seen, the performance results are all for simple
> vector<int>-like cases. One could easily imagine that the time taken for
> copying would dominate, diluting the algorithmic benefits.
If you run the tune.pl script that is in the base directory of the library,
you'll see the speedup on key+value sorts. They actually do less swaps
than a comparison-based algorithm, because there are less iterations, so
large chunks of data sort faster relative to std::sort with this approach
than small chunks of data. Some excerpts from my latest tune.pl run (with
no args) on Ubuntu:
std::sort time: 165.384112
std::sort time: 182.711132
std::sort time: 250.349607
Verifying integer_sort with separate key and data
std::sort time: 991.96294
Verifying string_sort with all functors
std::sort time: 311.68992
Verifying boost::sort on its custom-built worst-case distribution
std::sort time: 17.606037
I encourage all reviewers to run this script themselves to see how fast
this library is on their system (the README describes how to do this, or
you can just call ./tune.pl -small [-windows] from the libs/sort directory
and wait a minute).
A standard trick with a large data payload is to use a pointer to the data
to minimize the amount of memory that must be swapped, so sorting with a
really large data payload should be fairly unusual.