Boost logo

Boost :

Subject: Re: [boost] [parallel_sort] Proposal
From: Edouard A. (edouard_at_[hidden])
Date: 2009-02-02 05:08:05


On Mon, 02 Feb 2009 09:57:40 +0000, "Phil Endecott"
<spam_from_boost_dev_at_[hidden]> wrote:

>> Current benchmarks on my Q6600 vs2008-64-bit default allocator:
>>
>> 2 threads : 160 % faster
>>
>> 4 threads : 260 % faster
>
> So you're getting a super-linear speedup going from 1 to 2 processors?
> That seems odd.
>
> What size input does the benchmark relate to? I'm curious to know
> whether you still get a worthwhile speedup for relatively small inputs.

No sorry, this is wrongly said, I get factor 1.6 and 2.6. I meant 160 %
speed, not 160 % faster.

For small input it's very simple, anything under 1000 is not parallelized
because of the thread overhead cost.

This would require some precise benchmarking to find the right value.

>> It can be heavily customized. I already offer the possibility to choose
>> between quick sorting and merge sorting, and for quick sorting I offer
> two
>> pivot algorithms : fast or secure. Then you can of course specify a
>> predicate and a fallback sorting algorithm for when you run out of
> threads.
>
> How about run-time selection between quicksort and mergesort, in the
> style of introsort? This seems hard to beat in uniprocessor
applications.

Not really needed for merge_sort, there is no problematic behaviour like
quicksort. The
reason why you would want to use quicksort is that if the extra memory
allocation is an issue.

>
>> Nevertheless, I think it's time to open up the code and start getting
>> feedback.
>
> Yes please. You haven't said anything about how it actually works...

Threads are specified as a template parameter. There are finite. If you
wish to have a mutlithreaded version that uses 2 threads you do

parallel_sort<2>(v.begin(), v.end());

4 threads ?

parallel_sort<4>(v.begin(), v.end());

What happens then is that you have a template recursive call that generates
the right amount of threading through recursion. That means apart from
thread creation and termination you have little overhead.

For example, for 4 threads you have :

parallel_sort<4> -> parallel_sort<2> -> parallel_sort<1>
                                     -> parallel_sort<1>
                 -> parallel_sort<2> -> parallel_sort<1>
                                     -> parallel_sort<1>

This code is generated at compile time.

The one thread version use the fallback sorter which is by default
std::sort.

The default > 1 thread version merges the results of the underlying calls
(there is a version with quicksort).

Cheers.

-- 
EA

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk