# Boost :

Subject: Re: [boost] [SORT] Parallel Algorithms
From: Francisco JosÃ© Tapia (fjtapia_at_[hidden])
Date: 2014-12-22 16:56:19

Hi Steven,

I had been working about the memory used by the different algorithms,
specially the stable algorithms. I developed a low memory algorithm
circular_stable_sort, which use only a 10% more of memory, and the speed
decrease around 15% with small objects and with big objects is similar to
the merge_sort

I was looking too, the TimSort algorithm. I found a C++ implementation in
https://github.com/gfx/cpp-TimSort .

I inserted this algorithm in the benchmark programs. With small objects,
the speed is slower than the stable sort and merge_sort, and have a good
performance with big objects, greater than 64 bits. It's logic the decision
of Java and Python, because they must order objects, usually non contiguous
and bigger than the naked number.

For the measure of the memory used by the program I use the command
/usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
(100000000 uint64_t numbers), and comment all the invocations to the
algorithms , except one, compile, and run in a terminal with the command
/usr/bin/time -v benchmark_sort_04.

I repeat this process with all the algorithms for to know the memory used
by each algorithm.

Algorithm

Memory used

std::sort

1 565 456

countertree::intro_sort

1 565 280

GCC parallel::sort

2 346 868

tbb::parallel_sort

1 565 696

countertree::parallel_sort

1 565 468

std::stable_sort

1 957 232

countertree::merge_sort

1 955 908

countertree::circular_merge_sort

1 696 316

sfx::timsort

1 996 364

parallel::stable_sort

2 742 384

countertree::parallel_merge_sort

1 956 636

I checked the time with benchmark_sort_05, with copy all the structure in
the operation = and in the copy constructor. The number of elements in each
sort is 800000000/size of the element in bytes

8

bytes

16

bytes

32

bytes

48

bytes

64

bytes

128

bytes

256

bytes

std::stable_sort

16.99

10.36

9.75

9.62

8.85

8.89

10.25

countertree::merge_sort

16.96

10.38

8.13

7.44

7.10

6.67

9.01

countertree::circular_merge_sort

19.77

12.80

8.81

7.61

7.01

6.48

8.62

timsort

22.89

13.43

10.12

8.33

7.70

7.00

8.62

parallel::stable_sort

9.68

8.91

8.90

8.90

8.23

8.21

7.72

countertree::parallel_merge_sort

7.04

5.96

5.74

5.70

5.64

5.54

5.94

About the stable sort of tbb, I see, but I didn't have time to examine and
check in deep. But be quiet. The benchmark_sort indirect shows

Size of element 48 Number of elements :16666666

parallel_sort :2.53964 sec

indirect_parallel_sort :5.70979 sec

parallel_stable_sort :5.73186 sec

indirect_parallel_stable_sort :5.59269 sec

Size of element 64 Number of elements :12500000

parallel_sort :2.59812 sec

indirect_parallel_sort :4.1553 sec

parallel_stable_sort :5.63771 sec

indirect_parallel_stable_sort :3.95266 sec

Size of element 128 Number of elements :6250000

parallel_sort :2.28648 sec

indirect_parallel_sort :2.34792 sec

parallel_stable_sort :5.62879 sec

indirect_parallel_stable_sort :2.30188 sec

Size of element 256 Number of elements :3125000

parallel_sort :2.05017 sec

indirect_parallel_sort :1.50661 sec

parallel_stable_sort :5.85409 sec

indirect_parallel_stable_sort :1.48561 sec

As you can see with a 256 bytes of size the indirect parallel_stable_sort
is 3 times faster. And the indirect_sort is a 25% faster.

The final solution, if at end is taken, will be a hybrid algorithm
depending of the size of the elements

I propose you the next. Let me 2 or 3 weeks and I will prepare a big
benchmark with all the algorithms, including the indirect version of the
stable and unstable sort. We will run the benchmarks over different
machines, and according the results, take a decision about if we propose
something to Boost, and which algorithm must be used in the hybrid version.

If, at end, we decide don't propose nothing, I only can say this had been
a nice and exciting experience

Mi intention is not to impose my code, I want provide to the final users
the best solution, if exist. With my code or with the code of other. I am
the first to clap the winner.

We are doing this because this parallel implementation don't need any
other library, only the standard. There are compilers without OpenMP or TBB
support, or simply, the user don't want to load them for to make a sort.

I have in mind, several improvements of the algorithms. I will codify and
test, and if run well, I include in the algorithms. I want, too, simplify
the interface of indirect_sort, for to be used in a easy way by any sort
algorithm. If you are interested, I will prepare too a brief document with
the explanation and the internal algorithm.

At end, only say Happy Christmas, and New Year to the Boost community.

Yours

Francisco Tapia

fjtapia_at_[hidden]