Subject: Re: [boost] [SORT] Parallel Algorithms
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-12-18 22:14:41
On Tue Dec 16 2014 at 1:42:59 PM Francisco JosÃ© Tapia <fjtapia_at_[hidden]>
> Hi Steven,
> I thought all the program was checked, but obviously not. Now in the
> folder you have a shell script with the command for to compile and link
> with optimization ( by example make_benchmark_sort_03.sh), and a
> make_all.sh which compile all. Open a terminal in the folder, and compile
> and run
> Great! Thanks. I ran everything, and the only failure was with
Size of element 8 Number of elements :100000000
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)
> As suggested me, change the number generator and now use a Mersenne of the
> standard library.
> The stable sort is less used than the non stable sort, but sometimes is
> important. I developed because in my project have a proxy , which receives
> the operations over the elements defined by a key. The proxy sort, and send
> the operations to the corresponding thread, and it is not the same, read
> the value of a key and after delete that key, than delete the key and after
> try to read the value. I can't add a time mark for to know which arrived
> Fair enough.
> As say in the previous message, the stable sort, with 1 thread need 2.7,
> and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And
> the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4
> HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )
> This show countertree::merge_sort is a good for to use in a parallel SW
> and GCC std::stable_sort is bad.
The numbers I see are not as extreme, but your merge_sort looks good in
comparison, especially with large elements. I will investigate
alternatives to see if this is a truly speed-competitive offering.
Have you considered Timsort (standard in Java and Python)? Timsort is a
stable sort that is particularly good with sorted, mostly-sorted, and
reverse-sorted data. It might be worth offering as an alternative stable
sort for those concerned about mostly-sorted data.
Your intro_sort looks reasonable overall, but it varies from comparable
speed to 14% slower when run in parallel on random data, and is an order of
magnitude slower on already-sorted data. As tbb is free, I just don't see
a compelling case for your introsort implementation; tbb is a very hard
competitor to beat.
The GCC std::sort and countertree::intro_sort have similar performance
> with all the size of elements. But in the parallel process, when the size
> of the elements grow, the poor scalability appear. The next lines are are
> from Original benchmark_sort_05.cpp on my computer
> With 128 bytes elements
> Random elements, few repeated -----------------------
> STL std::sort :2.9188 secs
> countertree::intro_sort :2.87864 secs
> parallel::sort :2.46113 secs
> tbb::parallel_sort :1.86458 secs
> countertree::parallel_sort:1.89263 secs
> With 256 bytes elements
> Random elements, few repeated ( rand() )-----------------------
> STL std::sort :3.03143 secs
> countertree::intro_sort :3.02733 secs
> parallel::sort :2.23231 secs
> tbb::parallel_sort :1.76186 secs
> countertree::parallel_sort:1.56509 secs
> As we can see the parallel process is ineffective with GCC parallel::sort
> with big objects.
> Those don't match up with my results for 256 byte elements:
STL std::sort :1.04718 secs
countertree::intro_sort :1.14214 secs
parallel::sort :0.679178 secs
tbb::parallel_sort :0.600001 secs
And this is what I see for size 64:
STL std::sort :1.6577 secs
countertree::intro_sort :1.73023 secs
parallel::sort :0.963199 secs
tbb::parallel_sort :0.873805 secs
I'm not sure the benefits you are seeing are portable.
> About the merge_sort algorithm. It use additionally half of the memory
> used by the data. But if you don't use additional memory, the speed drops.
> I didn't examine in deep the Algorithm used in GCC, I only codified my own
> algorithm. I will examine, and I will try to reduce the memory used,
> without lost of speed and scalability.
The standard algorithm says it will lose speed when it reduces memory used;
it'd be a nice feature to have, but if we can repeatably see 2X relative
speedups without that feature, I think some people may find it useful.
> About the server for to test. I don't have modern servers in my school.
> The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3
> years old machine with 4 HW threads, and in the benchmarks is slower than
> the I3 and I5 machined mounted this year. If someone can check and provide
> us the result, I will be grateful. The version 4.9.1 of GCC have better
> optimization than the 4.8.2. If you are using Ubuntu, the version 14.10
> have the 4.9.1 version as predefined.
> A boost library should still be useful for people at least one compiler
version back; I feel fully justified testing on 4.8.2. Have you tested on
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk