Boost logo

Boost :

Subject: Re: [boost] [SORT] Parallel Algorithms
From: Francisco José Tapia (fjtapia_at_[hidden])
Date: 2014-12-16 13:42:45


Hi Steven,

 I thought all the program was checked, but obviously not. Now in the
folder you have a shell script with the command for to compile and link
with optimization ( by example make_benchmark_sort_03.sh), and a
make_all.sh which compile all. Open a terminal in the folder, and compile
and run

 All the programs use the HW thread and use all available. All the programs
had been done for to run in a Linux machine with 4 G of RAM. The number of
element involved in the sort is defined at the beginning of the program
with the #define NELEM. You can change this value, and compile again.

 As suggested me, change the number generator and now use a Mersenne of the
standard library.

 The stable sort is less used than the non stable sort, but sometimes is
important. I developed because in my project have a proxy , which receives
the operations over the elements defined by a key. The proxy sort, and send
the operations to the corresponding thread, and it is not the same, read
the value of a key and after delete that key, than delete the key and after
try to read the value. I can't add a time mark for to know which arrived
before

  In order to clarify the information obtained from the benchmarks :

 In the folder Modified the GCC parallel:sort and parallel:merge_sort, and
the countertree::parallel_sort and countertree::parallel_merge_sort use the
std::sort and the std::merge_sort. The utility of this folder is the
comparison with the same programs of the folder Original, the result show
the problems described in this message.

 In the folder Original the GCC parallel:sort and parallel:merge_sort, use
the std::sort and the std::merge_sort., and the countertree::parallel_sort
use countertree::intro_sort and countertree::parallel_merge_sort use
countertree::merge_sort.

 SCALABILITY : Some algorithms don't scale well when the number of threads
grow. Increasing the number of threads don't decrease the execution time.
This lack of scalability is emphasized when the size of the elements to
sort grows

 As say in the previous message, the stable sort, with 1 thread need 2.7,
and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And
the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4
HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )

 This show countertree::merge_sort is a good for to use in a parallel SW
and GCC std::stable_sort is bad.

 If you see the results of the countertree:parallel_merge_sort in the
benchmarks in the folder Modified, are really bad compared with the same
results in the folder Original. The difference is the internal algorithm.
In the folder Modified use GCC std::stable_sort, and in the Original use
countertree::merge_sort

 The GCC std::sort and countertree::intro_sort have similar performance
with all the size of elements. But in the parallel process, when the size
of the elements grow, the poor scalability appear. The next lines are are
from Original benchmark_sort_05.cpp on my computer

 With 128 bytes elements

Random elements, few repeated -----------------------

STL std::sort :2.9188 secs

countertree::intro_sort :2.87864 secs

parallel::sort :2.46113 secs

tbb::parallel_sort :1.86458 secs

countertree::parallel_sort:1.89263 secs

 With 256 bytes elements

Random elements, few repeated ( rand() )-----------------------

STL std::sort :3.03143 secs

countertree::intro_sort :3.02733 secs

parallel::sort :2.23231 secs

tbb::parallel_sort :1.76186 secs

countertree::parallel_sort:1.56509 secs

  As we can see the parallel process is ineffective with GCC parallel::sort
with big objects.

 About the merge_sort algorithm. It use additionally half of the memory
used by the data. But if you don't use additional memory, the speed drops.
I didn't examine in deep the Algorithm used in GCC, I only codified my own
algorithm. I will examine, and I will try to reduce the memory used,
without lost of speed and scalability.

 About the server for to test. I don't have modern servers in my school.
The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3
years old machine with 4 HW threads, and in the benchmarks is slower than
the I3 and I5 machined mounted this year. If someone can check and provide
us the result, I will be grateful. The version 4.9.1 of GCC have better
optimization than the 4.8.2. If you are using Ubuntu, the version 14.10
have the 4.9.1 version as predefined.

 Yours

 Francisco Tapia




Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk