Boost :

Subject: Re: [boost] [SORT] Parallel Algorithms
Date: 2015-03-25 15:57:22

Hi Steven,

thanks by you fast response. I want clarify several questions of your
message. Please, excuse me my English. It's not my mother tongue, and I
have errors and imprecisions.

TEST OBJECTS

The object used in the test ( Reburcio is a word of an exercise from my
first pascal book, and don't have any sense) In the construction , copy
construction and operator =, copy all the elements with a loop. The only
one operation which use only the first element is the comparison.

I had done a new comparison, and compare the sum of all the elements of the
object. The time are similar, you must add 5% in the small objects to 15%
in the big objects.

TIMSORT

The Tim Sort code is not mine. Tim Sort is used in Java and Android, and is
specially efficient with big objects. ( You can see in
benchmark_single_stable_sort_objects.txt). I use it for to speed compare. I
didn't explore many about Tim Sort, because I saw the methods based on
merge are faster, and I discarded the algorithm.

If you think , it deserve a second opportunity, we can look for new
implementations and check the speed in order to take a decision. I didn't
measure the auxiliary memory used by Tim Sort. Wikipedia talk about an
auxiliary memory O(N), but perhaps to say it's equal to the memory used by
the data is not correct.

In the next benchmark programs, as say Steven, the data are loaded from a
file, and the program execute only 1 algorithm. With this configuration
check the memory used is simple. Now it's very difficult because the
program execute all the algorithms, and it's not possible to know how many
memory use each algorithm.

DESIGN PHASE

Now , we are in the design phase. Nothing is fixed. We can comment and
change anything.

The specification is simple, provide sort algorithms, stable and not
stable, for to run with 1 thread or with any number of threads, without any
other external library ( as Open MP , TBB, CilkPlus..). Only with the C++
standard.

opinions. Any help is welcome. The final decision about if the code is
included or not in the boost sort library, depend of Steven, the creator of
the library.

After the design phase, with the decisions taken and the interfaces fixed.
We prepare the final version, redesign a test system, test in deep and
create the documentation.

This SW provide of other project for the Boost library ( Contertree,
concurrent red-black trees, with access by position , like the vectors),
pending the final revision since near two years ago.

When began, the implementation of parallel non stable sort, have similar
performance than the GCC version, based on OpenMP, and the TBB version.

The stable sort with 1 thread was slower than the GCC version, and the
parallel stable sort was faster than the GCC version based on OpenMP.

When someone mentioned the TBB sample sort implementation, even when is
experimental, don't destroy the object in the auxiliary memory, and it's
not exception safe. We began a crisis. That version was 20-30% faster than
our version. I want tho show my admiration by TBB and their team. I think
they are the best doing concurrent SW.

I tried to improve, I would like to have more time, more equipment, and
better documentation on order to find the best options, but I don't have,
and I do this in my free time after my work and my family.

I designed 6 stable sort algorithms, for to select the fast (internally
named smart_merge_sort, about 10% faster than the GCC version). I prepared
a version of Sample Sort, only with threads and atomic variables. In the
benchmarks have similar performance than the TBB version, use one half of
the auxiliary memory used by TBB and is exception safe.

In my opinion, the parallel stable sort is reasonably good for to be
included in the final version. The important case, which shows the quality
of the algorithm, is when the data are unsorted. I will try to implement
the detection of the sorted elements , like in TBB, but I am not sure about
the results. Th internal loops of the algorithm are delicate, and a small
change , can produce big changes in the performance, as happened in other
algorithms

We are in the design phase. We must test with others machines, big and
small. Perhaps we must change or redesign things. But we need help. If
someone test the programs, please send me the file result and the machine
description ( Processor , number of cores,speed, memory , cache ) in order
to obtain conclusions, fix the code and pass to the next phase.