DESCRIPTION:
Spreadsort is a fast general-case in-place hybrid radix/comparison algorithm, which in my testing tends to be 30-50% faster than std::sort.  Unlike many radix-based algorithms, it is designed around worst-case performance, and performs better on chunky data (where it is not widely distributed), so that on real data it can perform substantially better than on random data. Conceptually, Spreadsort can sort any data for which an absolute ordering can be determined.  Practically, Spreadsort as currently implemented can sort any-size integer (signed or unsigned) and floats cast to an integer.  To change the datatype being sorted, just change DATATYPE.

REQUIREMENTS:
g++ is required to compile the source
perl is required to run tune.pl

TESTING:
spreadsort sorts an input file made up of binary int datatypes
firstdim sorts an input file of binary int pairs based upon one integer, then the other.
Modify Spreadsort.h to change the sorted data types

TUNING:
If you can afford to let it run for a while, kick off this perl command:
./tune.pl -skip_verify
This will work to identify the ideal Constant.h settings for your system, testing on various distributions in a 40 million element (160MB) file.  If you're extremely performance-conscious and can afford to let it run for a long time, test with the file size you're most concerned with.
Otherwise, just use the options it comes with, they're decent.
With default settings, the script will take a few hours.
The tuning script will always verify that sorting is correct with its final settings, but -skip_verify means that it will skip this check during the tuning process, cutting runtime roughly in half.
On some systems (where ">" does not overwrite), it may be necessary to uncomment this line:
#system("rm -f $time_file");
which is on the third line of CheckTime
The -real option is for those who want to tune Spreadsort for its toughest situations only (and not some more common distributions), and/or want to use only real time values without weighting.  The -real option runs faster, but I recommend against using it for most purposes.  The default settings are probably better for most data.

PERFORMANCE:
Spreadsort is fast because it reduces the number of iterations relative to a comparison-based algorithm, thus reducing both the number of swaps and comparisons.  A single iteration of Spreadsort is significantly slower than an iteration of introsort or quicksort, but since it cuts the data up into many more chunks, Spreadsort can run many less iterations, and cut up the data for a comparison-based algorithm to sort much smaller pieces.  Spreadsort is in-place, so the memory usage is only slightly above what the input data uses already.
In recognition that not all OSes, compilers, and processors are the same, Spreadsort has tuning constants.  I recommend optimizing MAX_SPLITS for your page size (log(4096/8-byte bin = 512) = 9 is the default); the other tuning constants are less important, and left at decent values.  Getting the right value of MAX_SPLITS can easily make a 10% performance difference.  The other tuning variables are mostly to improve worst-case performance.

COMPILATION:
Spreadsort is written so that it can compile as a .c file in g++.  My testing finds that this make a 10-20% performance difference in the resulting executable.  My best guess is that the C compiler in g++ is better at optimizing code because it's simpler.  My version of gcc can't compile a .c file that contains a std::sort call (or references), so this won't compile with gcc.
Just type "make spreadsort" to make a sample sorting application.  Also included is the ability to compare runtime (and results) vs. std::sort to verify correctness and speedup.

MULTIKEY SORTING:
Also included is a call "SpreadSortFirstDimension" which shows an example of how to sort multi-dimensional data using Spreadsort.  You will need to change DATASTRUCT and KEY_TYPE to be the data types you wish to use.  Type make firstdim to make a sample application, and modify GetFirst to get the keys with whatever flexibility or simplicity you find necessary.  For more complex sorting applications, you may need to modify XDATASTRUCT, YDATATSTRUCT and turn on that optimization (see comments in SpreadSort.c).
SpreadSortLastDimension can be used combined with SpreadSortFirstDimension, or used on its own to sort data based upon just one key.

MAKING A LIBRARY:
Just run ./makeLib to make libSpreadSort.a

POSSIBLE IMPROVEMENTS:
A string version is possible, though it would need to be optimized and checked for worst-case a little differently.
An algorithm with some similarities to Spreadsort is extremely fast at external (hard drive) sorting.  I have not decided to open-source it at this time.
Spreadsort can be implemented to run efficiently in parallel, as it is a radix-based algorithm.
A static data structure that uses the Bin data structure created during Spreadsort's sorting process to do much faster lookups into the array than a binary search.  This requires keeping the Bin structure around, which will increase memory overhead relative to the pure sorting algorithm.  It should get near-hash performance with much less memory overhead.  An efficient dynamic version of this data structure would look more like a trie or Judy Array.

OPTIMIZATIONS NOT DONE:
There are various ways it is possible to optimize Spreadsort.  It is possible to only keep half the Bin data while recursing, saving a little memory, but I found this to make performance significantly worse on some systems.  It is also possible to iterate over the bins, but I found that a two-level loop doesn't optimize nearly as well.  It is also possible to use a pointer hack to avoid the divMin subtraction in the critical swap loop, but I highly recommend against this because it isn't general.  
Finally, you can put the bins on the stack, as opposed to the heap, or can reuse bin memory (making sure to set it to zero each time it is used) for recursive calls.  This could save memory allocation time overhead, but make the algorithm a little more complicated.  My testing shows no significant runtime overhead due to memory allocation, so I haven't done any of these.  If your malloc is inefficient, you may need to try some of these changes.
If you know the maximum and minimum before Spreadsort is called, you could skip the FindExtremes in the first iteration, but not any later iterations.

Improvements, suggestions, and bug reports are welcome at: spreadsort@gmail.com