
Boost : 
Subject: [boost] [move][sort] Asymptotically optimal stable and inplace merging and sorting with no auxiliary memory
From: Ion GaztaÃ±aga (igaztanaga_at_[hidden])
Date: 20160323 18:51:31
[Disclaimer, long email, thank you for your attention]
Hi to all,
Some months ago I tried to gain interest from sorting experts about
implementing stable and inplace merge [O(N)] or sort [O(NlongN)]
algorithms that don't need additional memory.
My main use case for these algorithms was lowering the complexity
guarantees of range insertion functions offered by flat associative
containers without any temporary allocation (the programmer does not
expect allocations when capacity() is big enough to hold the new
elements). On the other hand, I wanted those algorithms to be adaptive
so as to take advantage of the extra capacity already allocated by flat
containers.
Of course, if a good implementation is available in Boost, it could be
used as existing practice to change the worstcase complexities of some
STL algorithms when no additional memory (e.g. get_temporary_buffer())
is available.
After reviewing several algorithms from papers and some open source
implementations, I decided to implement one of those algorithms. The
implemented algorithm is based on the GrailSort algorithm developed by
Andrey Astrelin.
The main idea of the my adaptive_sort algorithm is explained by Andrey
in an article published by the russian collaborative blog Habrahabr
(http://habrahabr.ru/post/205290/). The algorithm is based on ideas
explained by Huang and Langston in their paper "Fast Stable Merging and
Sorting in Constant Extra Space (19891992"
(http://comjnl.oxfordjournals.org/content/35/6/643.full.pdf).
Since the implementation from (https://github.com/Mrrl/GrailSort) is
macrobased, does not support nonmemcopyable types and has a MIT
license (which seems to be problematic for Boost), I implemented the
algorithm from scratch following the article and taking ideas from
Andrey's implementation. I added some new ideas and optimizations in
order to reduce comparisons and elements moves.
I must say that the implementation was really hard for someone like me
without experience in sorting algorithms, the complexity is huge when
compared with a classic mergesort algorithm. The implementation is
already in develop and master branches but the develop branch contains
more tests and a refactored and more documented code. So please check
the code from the develop branch if interested. The implementation is
still experimental and not production ready.
What about benchmarks? I've only experimented in the oldest Visual
compilers to make sure it correctly supports Boost.Move emulation and
complexity guarantees were correct. Note that the STL algorithms tested
in the benchmark might not be as fast as current implementations.
In general, the new algorithms need much more elements moves than
algorithms that allocate O(n) memory, but they are still quite fast when
data movement is cheap (e.g. when a move constructor just copies a
couple of pointers).
These are the numbers for A VERY LIGHTWEIGHT object (a struct holding
two integers, one for the key, another one to test the stability, don't
take times very seriously, just pay attention to the number of
comparisons/moves).
//////////////////////////
//////////////////////////
STABLE MERGING
//////////////////////////
//////////////////////////
 N: number of elements
 NK: number of keys (0 means all keys different)
 InplaceMerge: std::inplace_merge
 Sqrt2AdaptMerge: adaptive_merge with an auxiliary buffer of 2*sqrt(N)
elements
 SqrtAdaptMerge: adaptive_merge with an auxiliary buffer of sqrt(N)
elements
 AdaptMerge: adaptive_merge with no external buffer
 Cmp: number of comparisons / N
 Cpy: number of move constructor/assignments / N
 (ratio): time of execution / time needed by InplaceMerge

Numbers produced by the bench_merge.cpp test in Boost.Move:

  N: 100001, NK: 511  
InplaceMerge Cmp: 0.9989 Cpy: 1.5000 633.65us ( 1.00)
Sqrt2AdaptMerge Cmp: 1.2429 Cpy: 4.4209 910.16us ( 1.44)
SqrtAdaptMerge Cmp: 4.2708 Cpy: 5.6712 1.63ms ( 2.57)
AdaptMerge Cmp: 5.6014 Cpy: 8.7573 3.11ms ( 4.91)
  N: 100001, NK: 2047  
InplaceMerge Cmp: 0.9998 Cpy: 1.5000 604.77us ( 1.00)
Sqrt2AdaptMerge Cmp: 1.2449 Cpy: 4.4483 934.83us ( 1.55)
SqrtAdaptMerge Cmp: 2.0131 Cpy: 4.7609 1.12ms ( 1.85)
AdaptMerge Cmp: 2.7608 Cpy: 11.1419 2.83ms ( 4.67)
  N: 100001, NK: 8191  
InplaceMerge Cmp: 0.9999 Cpy: 1.5000 653.51us ( 1.00)
Sqrt2AdaptMerge Cmp: 1.2443 Cpy: 4.3179 943.86us ( 1.44)
SqrtAdaptMerge Cmp: 1.4661 Cpy: 4.4348 998.62us ( 1.53)
AdaptMerge Cmp: 1.7692 Cpy: 10.6292 2.54ms ( 3.89)
  N: 100001, NK: 32767  
InplaceMerge Cmp: 1.0000 Cpy: 1.5000 772.06us ( 1.00)
Sqrt2AdaptMerge Cmp: 1.2417 Cpy: 4.4317 1.03ms ( 1.33)
SqrtAdaptMerge Cmp: 1.3320 Cpy: 4.4758 1.06ms ( 1.38)
AdaptMerge Cmp: 1.5295 Cpy: 10.1233 2.44ms ( 3.16)
  N: 100001, NK: 0  
InplaceMerge Cmp: 1.0000 Cpy: 1.5000 881.88us ( 1.00)
Sqrt2AdaptMerge Cmp: 1.2387 Cpy: 4.4124 1.16ms ( 1.31)
SqrtAdaptMerge Cmp: 1.2933 Cpy: 4.4412 1.20ms ( 1.36)
AdaptMerge Cmp: 1.4110 Cpy: 7.5794 1.93ms ( 2.19)
In general, with no memory and enough keys, the algorithm needs 7,5*N
data moves and 1,5*N comparisons. Data locality is quite high and with
cheapmoving types the overall time can be just twothree times slower.
With fewer keys, the algorithm might need upt to 5xN comparisons and
11xN moves. Number are improved if an auxiliary buffer of sqrt(N)
elements is available.
The number of comparisons and moves are even better than those shown in
experimental results from several papers, but I still find them quite
high. The algorithm needs some initial steps (e.g. unique key
extraction) that are not amortized during the merge (adaptive_sort
amortizes the initial steps much better ). I've derived the merge
algorithm from the sorting algorithm so maybe there is room for
improvement (I've invested much more time in the sorting algorithm).
//////////////////////////
//////////////////////////
STABLE SORTING
//////////////////////////
//////////////////////////
Numbers when sorting:
 N: number of elements
 NK: number of keys (0 means all keys different)
 MergeSort: A halfcopying mergesort (it only needs N/2 auxiliary
memory) implemented in boost/move/algo/detail/merge_sort.hpp. Based on
the paper "Fast mergesort implementation based on halfcopying merge
algorithm" by Cezary Juszczak
(http://kicia.ift.uni.wroc.pl/algorytmy/mergesortpaper.pdf). Usually
faster than std implementations with less auxiliary memory.
 StableSort: std::stable_sort
 HeapSort: std::heap_sort
 QuartAdaptSort: adaptive_sort with N/4 auxiliary memory
 Sqrt2AdaptSort: adaptive_sort with 2*sqrt(N) auxiliary memory
 SqrtAdaptSort: adaptive_sort with sqrt(N) auxiliary memory
 SqrtHAdaptSor: adaptive_sort with sqrt(N)/2 auxiliary memory
 AdaptSort: adaptive_sort with no auxiliary memory
 Cmp: number of comparisons / N
 Cpy: number of move constructor/assignments / N
 (ratio): time of execution / time needed by MergeSort
[numbers produced by the bench_sort.cpp test in Boost.Move]
  N: 100001, NK: 511  
MergeSort Cmp: 16.379 Cpy: 16.707 6.49ms ( 1.00)
StableSort Cmp: 20.594 Cpy: 22.978 10.85ms ( 1.67)
HeapSort Cmp: 16.992 Cpy: 22.098 10.98ms ( 1.69)
QuartAdaptSort Cmp: 17.659 Cpy: 26.314 8.34ms ( 1.29)
Sqrt2AdaptSort Cmp: 17.829 Cpy: 35.340 9.90ms ( 1.53)
SqrtAdaptSort Cmp: 27.805 Cpy: 56.059 19.48ms ( 3.00)
SqrtHAdaptSort Cmp: 27.356 Cpy: 60.375 20.35ms ( 3.14)
AdaptSort Cmp: 27.352 Cpy: 67.889 21.83ms ( 3.36)
  N: 100001, NK: 8191  
MergeSort Cmp: 16.394 Cpy: 16.725 7.47ms ( 1.00)
StableSort Cmp: 20.578 Cpy: 22.978 11.92ms ( 1.60)
HeapSort Cmp: 16.992 Cpy: 22.099 11.91ms ( 1.59)
QuartAdaptSort Cmp: 17.676 Cpy: 26.362 9.26ms ( 1.24)
Sqrt2AdaptSort Cmp: 17.847 Cpy: 35.747 10.93ms ( 1.46)
SqrtAdaptSort Cmp: 18.977 Cpy: 41.641 12.19ms ( 1.63)
SqrtHAdaptSort Cmp: 17.076 Cpy: 57.799 15.47ms ( 2.07)
AdaptSort Cmp: 17.090 Cpy: 67.179 17.37ms ( 2.33)
  N: 100001, NK: 32767  
MergeSort Cmp: 16.395 Cpy: 16.727 7.66ms ( 1.00)
StableSort Cmp: 20.581 Cpy: 22.986 12.16ms ( 1.59)
HeapSort Cmp: 16.994 Cpy: 22.099 11.96ms ( 1.56)
QuartAdaptSort Cmp: 17.674 Cpy: 26.319 9.47ms ( 1.24)
Sqrt2AdaptSort Cmp: 17.833 Cpy: 35.792 11.16ms ( 1.46)
SqrtAdaptSort Cmp: 18.967 Cpy: 41.604 12.42ms ( 1.62)
SqrtHAdaptSort Cmp: 17.065 Cpy: 57.756 15.71ms ( 2.05)
AdaptSort Cmp: 17.081 Cpy: 67.131 18.32ms ( 2.39)
  N: 100001, NK: 0  
MergeSort Cmp: 16.399 Cpy: 16.731 7.76ms ( 1.00)
StableSort Cmp: 20.521 Cpy: 22.920 12.23ms ( 1.58)
HeapSort Cmp: 16.991 Cpy: 22.094 11.85ms ( 1.53)
QuartAdaptSort Cmp: 17.671 Cpy: 26.380 9.50ms ( 1.22)
Sqrt2AdaptSort Cmp: 17.825 Cpy: 35.653 11.17ms ( 1.44)
SqrtAdaptSort Cmp: 18.954 Cpy: 41.545 12.42ms ( 1.60)
SqrtHAdaptSort Cmp: 17.058 Cpy: 57.694 15.66ms ( 2.02)
AdaptSort Cmp: 17.072 Cpy: 67.072 17.57ms ( 2.26)
In general, with no memory and enough keys, the algorithm needs nearly
optimal comparisons and significantly more (4x) data movements. Data
locality is quite high and with cheapmoving types the overall time can
be just twothree times slower. With fewer keys, the algorithm performs
worse but it's quite competitive.
When extra memory (e.g. sqrt(N)) is available, the algorithm is quite
fast with cheapmoving types.
////////////////////////////////////////////
Final words:
The effort to implement the algorithms was big and I couldn't fix many
open bugs in my libraries, so in the following months I don't expect to
invest much time on this. At least I wanted to publicly share the
committed implementation so that it can be improved by sorting experts.
I don't claim these algorithms are groundbreaking or even useful for
many programmers, but they might be a good starting point. I'm not happy
with the implementation as the algorithm is complicated and I' haven't
found a way to simplify it further without noticeable slowdowns.
For those interested in the implementation and the algorithm: please
contribute correct benchmarks and testing, as there are very few open
source implementations of similar algorithms. Any optimization, patch or
bug report is welcome.
Best,
Ion
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk