Boost logo

Boost :

Subject: Re: [boost] [SORT] Parallel Algorithms
From: Francisco José Tapia (fjtapia_at_[hidden])
Date: 2014-12-22 17:04:08


 I forget to mention the Windows version of the benchmark. I supouse, they
results will be similar to the obtained in Linux. I will prepare, too

With all the information, we can decide the best solution if exist.

Thanks

Francisco Tapia

2014-12-22 22:56 GMT+01:00 Francisco José Tapia <fjtapia_at_[hidden]>:

> Hi Steven,
>
> I had been working about the memory used by the different algorithms,
> specially the stable algorithms. I developed a low memory algorithm
> circular_stable_sort, which use only a 10% more of memory, and the speed
> decrease around 15% with small objects and with big objects is similar to
> the merge_sort
>
> I was looking too, the TimSort algorithm. I found a C++ implementation
> in https://github.com/gfx/cpp-TimSort .
>
> I inserted this algorithm in the benchmark programs. With small objects,
> the speed is slower than the stable sort and merge_sort, and have a good
> performance with big objects, greater than 64 bits. It's logic the decision
> of Java and Python, because they must order objects, usually non contiguous
> and bigger than the naked number.
>
> For the measure of the memory used by the program I use the command
> /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
> (100000000 uint64_t numbers), and comment all the invocations to the
> algorithms , except one, compile, and run in a terminal with the command
> /usr/bin/time -v benchmark_sort_04.
>
> I repeat this process with all the algorithms for to know the memory
> used by each algorithm.
>
>
> Algorithm
>
> Memory used
>
> std::sort
>
> 1 565 456
>
> countertree::intro_sort
>
> 1 565 280
>
> GCC parallel::sort
>
> 2 346 868
>
> tbb::parallel_sort
>
> 1 565 696
>
> countertree::parallel_sort
>
> 1 565 468
>
> std::stable_sort
>
> 1 957 232
>
> countertree::merge_sort
>
> 1 955 908
>
> countertree::circular_merge_sort
>
> 1 696 316
>
> sfx::timsort
>
> 1 996 364
>
> parallel::stable_sort
>
> 2 742 384
>
> countertree::parallel_merge_sort
>
> 1 956 636
>
>
>
> I checked the time with benchmark_sort_05, with copy all the structure
> in the operation = and in the copy constructor. The number of elements in
> each sort is 800000000/size of the element in bytes
>
>
>
>
> 8
>
> bytes
>
> 16
>
> bytes
>
> 32
>
> bytes
>
> 48
>
> bytes
>
> 64
>
> bytes
>
> 128
>
> bytes
>
> 256
>
> bytes
>
> std::stable_sort
>
> 16.99
>
> 10.36
>
> 9.75
>
> 9.62
>
> 8.85
>
> 8.89
>
> 10.25
>
> countertree::merge_sort
>
> 16.96
>
> 10.38
>
> 8.13
>
> 7.44
>
> 7.10
>
> 6.67
>
> 9.01
>
> countertree::circular_merge_sort
>
> 19.77
>
> 12.80
>
> 8.81
>
> 7.61
>
> 7.01
>
> 6.48
>
> 8.62
>
> timsort
>
> 22.89
>
> 13.43
>
> 10.12
>
> 8.33
>
> 7.70
>
> 7.00
>
> 8.62
>
> parallel::stable_sort
>
> 9.68
>
> 8.91
>
> 8.90
>
> 8.90
>
> 8.23
>
> 8.21
>
> 7.72
>
> countertree::parallel_merge_sort
>
> 7.04
>
> 5.96
>
> 5.74
>
> 5.70
>
> 5.64
>
> 5.54
>
> 5.94
>
>
> About the stable sort of tbb, I see, but I didn't have time to examine
> and check in deep. But be quiet. The benchmark_sort indirect shows
>
> Size of element 48 Number of elements :16666666
>
> parallel_sort :2.53964 sec
>
> indirect_parallel_sort :5.70979 sec
>
> parallel_stable_sort :5.73186 sec
>
> indirect_parallel_stable_sort :5.59269 sec
>
> Size of element 64 Number of elements :12500000
>
> parallel_sort :2.59812 sec
>
> indirect_parallel_sort :4.1553 sec
>
> parallel_stable_sort :5.63771 sec
>
> indirect_parallel_stable_sort :3.95266 sec
>
> Size of element 128 Number of elements :6250000
>
> parallel_sort :2.28648 sec
>
> indirect_parallel_sort :2.34792 sec
>
> parallel_stable_sort :5.62879 sec
>
> indirect_parallel_stable_sort :2.30188 sec
>
> Size of element 256 Number of elements :3125000
>
> parallel_sort :2.05017 sec
>
> indirect_parallel_sort :1.50661 sec
>
> parallel_stable_sort :5.85409 sec
>
> indirect_parallel_stable_sort :1.48561 sec
>
> As you can see with a 256 bytes of size the indirect
> parallel_stable_sort is 3 times faster. And the indirect_sort is a 25%
> faster.
>
> The final solution, if at end is taken, will be a hybrid algorithm
> depending of the size of the elements
>
>
> I propose you the next. Let me 2 or 3 weeks and I will prepare a big
> benchmark with all the algorithms, including the indirect version of the
> stable and unstable sort. We will run the benchmarks over different
> machines, and according the results, take a decision about if we propose
> something to Boost, and which algorithm must be used in the hybrid version.
>
> If, at end, we decide don't propose nothing, I only can say this had
> been a nice and exciting experience
>
> Mi intention is not to impose my code, I want provide to the final users
> the best solution, if exist. With my code or with the code of other. I am
> the first to clap the winner.
>
> We are doing this because this parallel implementation don't need any
> other library, only the standard. There are compilers without OpenMP or TBB
> support, or simply, the user don't want to load them for to make a sort.
>
>
> I have in mind, several improvements of the algorithms. I will codify
> and test, and if run well, I include in the algorithms. I want, too,
> simplify the interface of indirect_sort, for to be used in a easy way by
> any sort algorithm. If you are interested, I will prepare too a brief
> document with the explanation and the internal algorithm.
>
> At end, only say Happy Christmas, and New Year to the Boost community.
>
> Yours
>
>
> Francisco Tapia
>
> fjtapia_at_[hidden]
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk