Boost logo

Boost :

Subject: Re: [boost] [SORT] Parallel Algorithms
From: Steven Ross (spreadsort_at_[hidden])
Date: 2015-03-23 06:46:19


On Fri, Mar 20, 2015 at 9:12 AM Francisco José Tapia <fjtapia_at_[hidden]>
wrote:
>
> I have the final version of all the algorithms. I wrote 6 algorithms of
> stable sort, and selected the fastest. On my computer is about 15% faster
> than the GCC stable sort.
>

Could you send them please? I'd like to verify the speedup you're seeing,
and compare vs TBB. I'm ok if it just is comparable in speed to TBB and it
has additional advantages of some type.

>
> I had problems too with an error in an algorithm mixed with an error in
> the GCC dynamic memory system, Under certain sequence of request, the
> system provide a non valid address and due this a segmentation fault. The
> debug for to detect and correct had been a nightmare.
>

One thing to watch out for is undefined operations, such as type
conversions between int and float types; those aren't compiler bugs:
they're undefined (this problem hit me with float_sort, before someone
showed me how to convert safely). I hope you've fixed the bug.

> The algorithms provided are :
>
> 1.- sort : The algorithm is intro_sort. Don't need additional memory.
> Performance similar to the GCC version and Windows version. Non stable

> 2.- parallel_sort : Decompose the global range in small ranges for to be
> sorted with intro_sort. Don't need additional memory. Similar performance
> than the TBB version and Microsoft version. Non Stable
>

Does it provide any advantage over those implementations, aside from being
open-source?

>
> 3.- stable_sort : The algorithm is called smart_merge_sort, and is about
> 10-15% faster than the GCC stable sort version, The additional memory is an
> half of the memory used by the data.( The GCC stable sort use the same
> memory than the data)
>

GCC uses no additional memory, or 100% additional memory?
Did you compare against the tbb stable sort I pointed you to?

>
> 4.- sample_sort : is a parallel stable sort. Have the same performance than
> the TBB implementations, and much more faster than the GCC parallel
> stable_sort. The additional memory is the same than the data
>
> 5.- parallel_stable_sort : This algorithm is based on the sample sort.
> Increase the sample_sort time about a 5%, but the additional memory used is
> an half of the memory used by the data.
>

That's interesting, so it's better from a memory perspective than the
competing algorithms, and not bad in speed? That sounds like it's worth
serious consideration as an addition.

> The functions are in the namespàce boost::sort All the algorithms are
> exception safe, and use indirect sort when the size of the objects is big.
> According to my benchmarks, the size limit for to change a indirect sort is
> 128 bytes.
>
> The function format of the parallel algorithms are
>
> algorithm ( first, last, compare , number_threads).
>
> The number of threads is passed with an struct NThreads which receives an
> unsigned, and return an unsigned. The default construct of this struct is
> filled with the HW threads
>
> At today I am testing on Windows with Visual Studio 2013 CTP 4, and
> changing some things of the code due to the incompatibility with C++11
> features
>
> In a few days I will document all the functions of the code for generate a
> doxygen documentation, and can send you a version of the code.
>
> Then we must talk about how to design the benchmark. I think the starting
> point can be your experience with your code
>

Please look at this code, and refactor/reuse what test code you can:
https://github.com/boostorg/sort
Specifically, look in the test directory, the example directory, and look
in tune.pl. Are there any parameters you would want to tune for specific
systems (besides the number of threads)?
The basic idea is that randomgen writes out a randomized distribution to
disk, and then we run whatever algorithms we want to compare (writing their
results to disk), verify that the results are correct, then compare speed.
I'd prefer not to skip the disk intermediate step because having the files
on disk makes it easier to debug when something gives incorrect results.
We should compare with variable-length strings and ints, in addition to
more complicated data structures like what you've been testing with.
It's worth noting that it's quite common to use vectors for large
quantities of data in a class/struct, in which case the sorting would
already be indirect.

>
> The C++ committee propose a format for to call the parallel functions
> described in
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
>
> it would be a good idea to examine, and decide if follow the proposal, if
> we don't follow or if we provide the two interfaces for the parallel
> algorithms. According with my first impression it's not very complex to
> implement.
>

With your implementations, it looks like people have these choices:
1) how many threads to use
2) stable or not
3) memory or speed efficient
4) direct or indirect

Note that #4 may be influenced by #3, and that you may be able to determine
that automatically, but providing people a simple way to make those choices
is a good idea. You could use the approach suggested by the committee if
it fits well, or another one if you have a better approach to this specific
problem.

Steve


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk