Boost :

Date view	Thread view	Subject view	Author view

From: Steven Ross (spreadsort_at_[hidden])
Date: 2008-07-05 17:05:30

Next message: David Abrahams: "Re: [boost] Proposed templated integer_sort"
Previous message: JOAQUIN M. LOPEZ MUÑOZ: "Re: [boost] [flyweight] rw_locking policy patch for review"
In reply to: Phil Endecott: "Re: [boost] Proposed templated integer_sort"
Next in thread: David Abrahams: "Re: [boost] Proposed templated integer_sort"
Reply: David Abrahams: "Re: [boost] Proposed templated integer_sort"

On Sat, Jul 5, 2008 at 4:46 AM, Phil Endecott <
spam_from_boost_dev_at_[hidden]> wrote:

> Steven Ross wrote:
>
>> I've attached a testcase.
>>
>
> Running this code, I see a 36% improvement.
>

Good; that confirms my results, and that Spreadsort can get a significant
improvement with the right constants on your data.

> It also casts the floats to ints because I haven't
>> written float_sort yet, and I don't have your fixed-point code to work
>> with.
>>
>
> The benchmarks I was doing before were simply multiplying the float by 1e6
> and rounding to an int. I was also only storing and sorting the actual
> number, not the associated data; I see that your code sorts everything. And
> I was measuring elapsed time rather than CPU time. I don't know how much
> difference any of that makes.

I've found CPU time to give slightly more stable results, particularly when
there are processes running in the background. That said, I wouldn't worry
about the difference.

> When I simply plug the new Constants.h into my existing test I see a
> further slowdown; it's actually slower than std::sort.
>

There is a good reason for that:
MAX_SPLITS is picked as the maximum number of bins that won't cause a cache
miss. When that value is exceeded, lookups to find which bin an item
belongs in will be performance limited by memory latency. For this reason,
my testing on sorting just keys (no data) finds that runtime increases
50-100% when MAX_SPLITS exceeds what can fit on a memory page.

These differences clearly show that constant-picking is important, and that
I should probably do some more work to try to gain some more performance.

At a high level, there are three main types of operation that take
significant runtime in Spreadsort (outside of calls to std::sort):
Iterating over the data vector -> this is done three times per iteration.
One iteration is for finding the max and min, the second is for finding how
many items are in each bin, and the third is part of the swap loop.
Swapping data when it is out of order in the core sorting loop.
Random accesses into the bin array array to find the correct bin. This is
done twice per element: once to determine how many items belong in a bin,
and a second time to decide which bin to swap an element into.

As long as the bins are small enough in size to avoid bin lookup cache
misses, the only cache misses are the swap operations, so there is an
average of 1 cache miss per element (assuming the data doesn't start in the
right bins).
When the number of bins increase beyond that point, there is an average of 3
cache misses per element, substantially increasing total runtime. On the
other hand, less iterations are necessary the greater the number of bins, so
if the bin count becomes large enough, it's theoretically possible for it to
run faster.
If the bin count becomes too close to n, overhead in processing bins can
also impact performance. In my testing, as long as LOG_MEAN_BIN_SIZE >= 3,
then this issue is not significant.
When the key is the only data involved, then swaps don't take much time,
because they involve moving only a small amount of data. When data is added
in addition to the key, swap time increases, eventually to the point where
it's faster to have less iterations even at the expense of more cache misses
on bin lookups. That explains why the cache-safe constants I sent you
initially are faster when sorting just keys, but the higher bin counts are
faster with key+data.

As Spreadsort generally becomes faster relative to std::sort as the amount
of non-key data increases, I normally optimize for the worst relative case,
key-only data, and for that, a cache-friendly bin count is clearly
superior. With that said, I may want to revisit the way I allocate the
number of bins, and use a different approach for larger data types.

There are 2 decisions currently made based upon tuning constants in
Spreadsort:
Recurse or call std::sort? This is pretty well handled by calling std::sort
if the number of items is less than a modest constant (usually 256), and
also checking for relative worst-case performance between the two, and
picking the one with the better worst-case performance. It might be useful
to call insertionsort for really small n if std::sort isn't well-optimized
for tiny n (say 16), but that's about the only improvement I can think of.
How many bins should be used? Changing this can make significant
differences in performance, as we just saw. I experimented a little with
the constants for the example I sent you (see attached), and found that
there is a slightly modified set of constants that shows a 38% speed
improvement on my system with my testcase. Most notable was that the ideal
MAX_SPLITS was 15, which is roughly half of the log of the range. That
suggests to me that it is fastest to split up the range half at a time in
even-sized chunks for this data set, when the DATATYPE is large. The
current approach works like this (in pseudocode):
logbincount = log(n) - LOG_MEAN_BIN_SIZE (normally 3)
if(logbincount > MAX_SPLITS)
logbincount = MAX_SPLITS

I'm thinking that if sizeof(T) > sizeof(Bin), then I should apply a
different approach:
logbincount = log(n) - LOG_MEAN_BIN_SIZE (normally 3)
if(logbincount > log(range))
  logbincount = log(range) // this will complete sorting in just one
iteration
else if(logbincount > MAX_SPLITS) {
  if(logbincount >= log(range)/2)
    logbincount = (log(range) + 1)/2; //+1 so that odd numbers round up, not
down
  else
    logbincount = MAX_SPLITS; //otherwise just minimize cache misses
}

This is basically special-casing two situations:
1) The range can be divided in a single iteration, in which case we should
just finish the job.
2) The range can be divided in two iterations, and it may often be faster to
do in two even steps.
In cases where there is more work to do, it won't try to take a shortcut,
because that invites lots of unnecessary cache misses.
I'm open to suggestions of better ways to handle this.

> These mailing list attachment archive links still don't work. Who is the
> mailman expert?
> They're accessible via gmane though.
>

Is there a better way to exchange testcases, given that I don't have a
personal website?

Steve

text/plain attachment: Constants.h

Next message: David Abrahams: "Re: [boost] Proposed templated integer_sort"
Previous message: JOAQUIN M. LOPEZ MUÑOZ: "Re: [boost] [flyweight] rw_locking policy patch for review"
In reply to: Phil Endecott: "Re: [boost] Proposed templated integer_sort"
Next in thread: David Abrahams: "Re: [boost] Proposed templated integer_sort"
Reply: David Abrahams: "Re: [boost] Proposed templated integer_sort"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk