Boost logo

Boost :

Subject: Re: [boost] [parallel_sort] Proposal
From: Edouard A. (edouard_at_[hidden])
Date: 2009-02-02 13:21:37


On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott"
<spam_from_boost_dev_at_[hidden]> wrote:

> So basically it's something like this:
>
> thread t( sort(begin,mid) );
> sort(mid,end);
> t.join();
> merge(begin,mid,end); // [*]

Exactly, with some differences as I do a memory copy before the thread and
I run two threads.

 Working on different buffers allows to prevent hidden system locks when
you reach the middle.

so it's more like

buffer = new [];
copy(...)
thread t1(sort(buffer,...);

buffer2 = new [];
copy(...)
thread t2(sort(buffer2, ...);

t2.join();
t1.join();

merge(buffer, buffer2, result);

> The trouble with this is that one of the threads will terminate before
> the other, e.g. because its data was easier to sort or was in
> more-local memory, and it will then do nothing until the other thread
> (and all its sub-threads) has also finished. It would be better to
> keep all threads occupied for as much of the time as possible, e.g. by
> dividing the problem up into chunks and giving the chunks to threads as
> and when they are ready to handle them.

You're absolutely right and this is my primary concern.

That's what the TBB does, I wanted to try something more simple.

Doing what you describe requires writing a scheduler and a dispatcher for
you threads. This is costly. If you create it for each parallel_sort call
then you get a huge performance hit. You could also decide to create it
only for the first parallel_sort call which means that the following calls
won't get the hit (or create it at runtime).
 
> You mentioned the Intel implementation before. Have you benchmarked
> that in comparison with your algorithm?

I'm in the process of doing so. If the TBB does really better then I will
see how I could use their approach, or maybe an hybrid approach.

> I guess that you may have benchmarked with random input data. This
> will tend to favour your method as the input is uniformly difficult to
> sort. You should also benchmark with input that is less homogeneous,
> e.g. data that's mostly sorted but with local permutations, etc. Is
> anyone aware of synthetic or real benchmark data sets for sorting?

I think there is.

I test on random and sorted data.
I will verify again tonight and give you the results.

> [*] Of course a fundamental issue for this algorithm is that even if
> the two sub-sorts are equally time-consuming, the final merge is still
> a sequential operation. But I just wondered whether it would be
> possible to have one thread start doing
>
> in_place_merge(begin,mid,end);
>
> while the other thread starts at the other end, going backwards:
>
> in_place_merge(rbegin,reverse_iterator(mid),rend);
>
> Of course they will crash into each other in the middle. But maybe
> it's possible to do this in some sort of safe lock-free manner?

It's maybe more efficient to search in parallel in the two arrays, build
some sort of index
and then merge.

But you are right I was lazy to use std::merge instead of writing my own
parallel_merge.

-- 
EA

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk