Boost logo

Boost :

Subject: Re: [boost] [sort] Parallel sorting sub-library mini-review. Performance tests.
From: Christophe Henry (christophe.j.henry_at_[hidden])
Date: 2016-11-22 15:23:19


>>
>> OTOH the single-threaded ones are interesting, especially the stable_sort
>> and intro_sort for cases where usage of spreadsort is not possible.
>>

>I challenge you to find a sorting problem where it isn't possible to use
>spreadsort. string_sort can sort anything that std::sort can;
> you just need to convert the sort key comparison sequence into a sequence
of bytes
> and feed string_sort the correct functors. That said, it may not be
>efficient to sort some problems with string_sort,
>and it'll probably be too much work to be worth the effort for many
problems.

This is what I meant. Sorry for the lack of precision.
I'll add it to my performance tests, it might be an interesting variant as
the algorithm is really fast.

>>
>> A short summary:
>>
>> 100000000 uint64_t elements already sorted:
>> Boost parallel sort : 0.064965 secs
>>
>> Asynchronous parallel sort : 0.0167134 secs (same with other
>> algorithms)
>>
>> Asynchronous provides a special optimization for this case.
>>

>If this optimization is practical to incorporate in Boost parallel sort,
>it is a common case and your solution seems noticeably faster,
>so I would like Francisco to consider it. That said, .06s isn't bad.

Sure.

>>
>> I added this one:
>> 100000000 uint64_t elements reverse sorted
>>
>> Boost parallel sort : 0.581381 secs
>>
>> Asynchronous parallel sort : 0.0478701 secs
>>
>> Asynchronous provides a special optimization for this case.
>> I this the library should do it too. This one is pretty common and a
>> popular DOS attack.
>>

> This optimization would be nice to have, if it's cheap.

A parallel reverse is in O(n) so it's cheap compared to a sort.

>>
>> 100000000 uint64_t elements randomly filled
>>
>> Asynchronous parallel quickspreadsort: 0.587432 secs
>> Asynchronous quick_intro_sort : 0.728393 secs
>>
>> This mixing of quicksort degrading into a spreadsort works here best.

>I find this algorithm interesting;

Frankly, there is nothing special. It's simply a parallel partition, which
degrades into a "normal" parallel merge sort (in this case spreadsort)
after a while.

>have you considered using MSD radix-based sorting to break up the data
> for the threads in parallel?
>For example, take T threads, have each thread take 1/T of the data
> and bucket it into N MSD-based buckets (just like Spreadsort),
>where T < N <= ~2^11, then merge the equivalent buckets,
>and as you merge them hand them off to another thread to sort.
>This will probably divide better if you find the max and min before
bucketing
>using a parallel min/max search.

I admit I didn't. But it sounds really interesting and I think I'll have a
go at it.
I'm not a sort expert though. I approached this from the parallel
perspective.

>> 10000000 strings randomly filled
>>
>> OMP parallel sort : 1.05803 secs
>> Boost parallel sort : 0.933055 secs
>>
>> OMP parallel stable sort : 1.12269 secs
>> Boost sample sort : 0.889216 secs
>> Boost parallel stable sort : 1.56564 secs
>>
>> Asynchronous parallel quickspreadsort: 0.788856 secs
>> Asynchronous quick_intro_sort : 0.893652 secs
>>
>> Asynchronous parallel stable sort : 1.23495 secs
>> Asynchronous boost::stable sort : 1.21817 secs
>>
>> Similar results.
>>
>>
>> Let's move on to big objects:
>>
>> 1562500 elements of size 512 randomly filled
>> H E A V Y C O M P A R I S O N
>>
>> OMP parallel sort : 0.308923 secs
>> Boost parallel sort : 0.342992 secs
>>
>> Asynchronous parallel_quicksort : 0.246709 secs
>> Asynchronous quick_intro_sort : 0.269666 secs
>>
>> Both quicksorts are best, with a slight advantage to intro_sort.
>>
>> The light comparison is similar.
>>

>Do we have your permission to copy and/or adapt your quick_intro_sort
>algorithm into the Boost.Sort library?

Please feel free to copy whatever might be useful. The library has a Boost
license as it's intended for inclusion in Boost.I also took the freedom of
adding the single-thread versions of Fernando's algorithms as basis for my
parallel versions.
The algorithm is quite simple. It's the same as quick_spreadsort, a
quicksort which degrades into Francisco's intro_sort after some iterations.
I just found it interesting to add a wrapper to Francisco's algorithm for
the mini-review.
It was just a quick test (no time for cleverer algorithms) but I'm pretty
satisfied with the result and I'll keep these versions.

>How is the worst-case performance relative to Francisco's library?

The sort versions have O(n log n) like any merge sort. The quicksort have
O(n log n) to O(n^2) as expected from a quicksort.

Cheers,
Christophe


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk