|
Boost : |
Subject: Re: [boost] sorting floats by casting to integer
From: Steven Ross (spreadsort_at_[hidden])
Date: 2009-07-04 19:35:51
I'm replying to everybody at once; I appreciate all your input.
On Sat, Jul 4, 2009 at 1:52 AM, Edouard A. <edouard_at_[hidden]> wrote:
> >I've been investigating why float_sort is so much faster than std::sort on
> >x86 processors, and why sorting integers is so much faster than
> float_sort,
> >and came to the conclusion that floating-point comparisons are extremely
> >slow on x86.
>
> Hey Steven,
>
> Is it true even with SSE3 optimizations turned on and data aligned to
> 16-bytes boundaries?
>
> Which compiler were you using?
>
> I am using MSVC 2005 standard edition with bjam's default release settings
for it. I have not tried other compiler settings, besides 64-bit
compilation, which has comparable performance.
This is on a Core 2 Duo purchased earlier this year for Boost development.
I saw the same behavior and the same magnitude of speedup for float_sort vs.
std::sort on floats on a 6 year old refurbished Dell Windows system using
gcc in Cygwin.
With two different compilers on processors manufactured at least 6 years
apart, I suspect it's an issue with the architecture.
There may be Intel processors that are faster for this purpose, but they
certainly aren't the two mass-market Intel processors I've tried, with
default optimizations.
If anybody would like to look at my testcase, it is in the boost vault,
inside algorithm_sorting.zip. Inside there, it is in
libs/algorithm/sorting/samples/float_as_integer.cpp. Even without using
integer_sort, and just using std::sort on integers with the code I posted
initially, the speedup is on the same order (80X). I tested using randomgen
7 16 1000000 (randomgen is compiled from my randomgen.cpp). 7 is the bit
range among the "high" 16 bits, the second number (16) is the bit range
among the "low" 16 bits, and 1000000 is the size of the vector to sort. 23
bits are used to exclude NaNs, which I assume to be uncommon in real data,
and which are sorted arbitrarily, meaning that two correct sorted results
containing NaNs will often not be identical. With mostly NaNs the speedup
with this approach is about 2X.
>From the libs/algorithm/sorting directory (once copied into the boost tree),
you can test this way:
bjam --tune randomgen
randomgen 7 16 1000000
bjam --tune float_as_integer
float_as_integer (uses the approach I described)
float_as_integer -std (uses std::sort on floats)
(I see a huge difference in runtime on x86, but not on PowerPC)
> If you think about it, if casting to integer to compare floating point
> values is faster, why doesn't the processor do that itself?
> I have been told that some processors do.
PowerPC processors appear to do so, as floating-point sorting for them is
only slightly slower than integer sorting, much faster than the x86
processors I've tried.
I don't know why the x86 processors I've tested don't do it that way, but my
best guess is that they are, but their cast operation is slow (switching
between float and integer registers). My casting certainly was extremely
slow relative to making a integer copy that I only cast once, and using
that.
> Some floating point values that don't correspond to real numbers
> (like INF and NaN) have a very complicated behavior with respect to
> comparison. At least I remember that std::sort used to crash when
>trying to sort data containing NaNs. I think it has something to do
> with "a < b" being somehow related to "(a-b) < 0" or "0 < (b-a)"
> and a minus operation involving a NaN must yield a NaN.
You are correct.
a < NaN and NaN < a both return false (NaN is treated as equal to
everything). Other odd values have similar behavior.
As this places such invalid values in arbitrary places depending upon the
comparison-based algorithm you are using and the input order, I cannot see
how it is in any way superior to what integer sorting will do with it, which
is put NaNs in a specific, non-arbitrary place, based upon their bit value.
> Possibilities that occur to me include:
> - Unnecessary transfers between integer and floating point registers.
I suspect this is the cause of the delay, but these transfers may be
necessary.
> - Memory alignment issues.
Why? I see the same issue with doubles vs. 64-bit integers.
> - Lack of optimisation.
I'm compiling in bjam with release or msvc-8.0; with gcc I compiled with
-O3. All show similar results.
> - Something related to NaNs, de-normals etc (do you have these in your
test data?).
I've tested with NaNs. As long as there aren't many NaNs, the runtime isn't
impacted much by them.
What I see is that std::sort is sped up by NaNs, as they are considered
identical to everything and thus less effort is required to "sort" them; I
don't consider 90% NaNs a realistic performance test. My integer-based
approach just cares about bits, so it puts the NaNs in a specific place and
takes roughly the same time as it does with normal floating-point values.
> Also, I don't see why you actually need to copy all the data to achieve
> what you're doing. Can't you cast it "in place" and then deal with the
>negative numbers (etc) as an in-place pre- or post-processing step?
With regards to casting to an integer, in further testing I found that if I
cast the floats to an integer for comparison, but leave them in their
original vector of floats, and sort them as integers that way, that the
speedup is roughly 2X, not 120X (slower than my current "float_sort"
algorithm which gets 7X). The casting operation must also be slow on x86
(it isn't on PowerPC), even though it's just this:
int result;
std::memcpy(&result, &(*floatiter), sizeof(float));
Note that a direct cast from a float to an integer is illegal, and can cause
undefined behavior.
I'm thinking that the more general approach for key plus data sorting with
casting to an integer is this:
User provides first and last iterators, along with a method to cast values
to an integer.
A vector is created of integer, element pairs.
This vector is then sorted on the integer key plus the iterator as data.
Once the sorting is complete, the elements are assigned to the corresponding
location in the iterators passed in. Memory overhead is 1 integer plus a
full copy per element.
There may be more memory-efficient algorithms, but this should be simple and
fast. I've considered using an iterator as an index instead, but then
putting the data into the correct order in O(n) time without substantial
additional memory is tricky; the best such algorithm I could think of uses 1
integer + 3 iterators memory overhead per element.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk