|
Boost : |
Subject: Re: [boost] [parallel_sort] Proposal
From: Edouard A. (edouard_at_[hidden])
Date: 2009-02-02 13:21:37
On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott"
<spam_from_boost_dev_at_[hidden]> wrote:
> So basically it's something like this:
>
> thread t( sort(begin,mid) );
> sort(mid,end);
> t.join();
> merge(begin,mid,end); // [*]
Exactly, with some differences as I do a memory copy before the thread and
I run two threads.
Working on different buffers allows to prevent hidden system locks when
you reach the middle.
so it's more like
buffer = new [];
copy(...)
thread t1(sort(buffer,...);
buffer2 = new [];
copy(...)
thread t2(sort(buffer2, ...);
t2.join();
t1.join();
merge(buffer, buffer2, result);
> The trouble with this is that one of the threads will terminate before
> the other, e.g. because its data was easier to sort or was in
> more-local memory, and it will then do nothing until the other thread
> (and all its sub-threads) has also finished. It would be better to
> keep all threads occupied for as much of the time as possible, e.g. by
> dividing the problem up into chunks and giving the chunks to threads as
> and when they are ready to handle them.
You're absolutely right and this is my primary concern.
That's what the TBB does, I wanted to try something more simple.
Doing what you describe requires writing a scheduler and a dispatcher for
you threads. This is costly. If you create it for each parallel_sort call
then you get a huge performance hit. You could also decide to create it
only for the first parallel_sort call which means that the following calls
won't get the hit (or create it at runtime).
> You mentioned the Intel implementation before. Have you benchmarked
> that in comparison with your algorithm?
I'm in the process of doing so. If the TBB does really better then I will
see how I could use their approach, or maybe an hybrid approach.
> I guess that you may have benchmarked with random input data. This
> will tend to favour your method as the input is uniformly difficult to
> sort. You should also benchmark with input that is less homogeneous,
> e.g. data that's mostly sorted but with local permutations, etc. Is
> anyone aware of synthetic or real benchmark data sets for sorting?
I think there is.
I test on random and sorted data.
I will verify again tonight and give you the results.
> [*] Of course a fundamental issue for this algorithm is that even if
> the two sub-sorts are equally time-consuming, the final merge is still
> a sequential operation. But I just wondered whether it would be
> possible to have one thread start doing
>
> in_place_merge(begin,mid,end);
>
> while the other thread starts at the other end, going backwards:
>
> in_place_merge(rbegin,reverse_iterator(mid),rend);
>
> Of course they will crash into each other in the middle. But maybe
> it's possible to do this in some sort of safe lock-free manner?
It's maybe more efficient to search in parallel in the two arrays, build
some sort of index
and then merge.
But you are right I was lazy to use std::merge instead of writing my own
parallel_merge.
-- EA
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk