Subject: Re: [boost] sorting floats by casting to integer
From: Steven Ross (spreadsort_at_[hidden])
Date: 2009-07-06 11:20:37
On Sun, Jul 5, 2009 at 9:17 AM, Phil Endecott <
> If the problem were simply that "x86 FP is very slow", then this problem
> would be evident with std::sort as well as your algorithm. I've therefore
> done a simple test with a program like the following:
Agreed Phil. I do see it with std::sort. Also Vladimir also saw a large
runtime difference with 23-bit random data, though not quite as large.
> Note that I haven't actually checked if I get the right answers from this,
> so please let me know if you can see any mistakes! Also note that I have
> not done any fixup for negative floats, so some small additional time would
> be needed for that.
> I've run variations of this code with the following results:
> vector<int>, std::sort(begin,end) : 3.71 s
> vector<float>, std::sort(begin,end) : 9.12 s
> vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.19 s
My results, with 32-bit random data, which is what you appear to be using
(you did not provide your random_fill function; I used randomgen 16 16
10000000 to get these results):
vector<float>, sorting as integers (my approach) : 1.417s and data is
vector<float>, std::sort(begin, end): 2.793s and data is correctly sorted
vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.291s and data
is not correctly sorted (because negatives aren't flipped)
I set NaNs (I get 3975 of them) to 0.0 in my testing so as to have a unique
correct sorted output.
I'm using a Core 2 Duo.
I think I've figured out what's going on with my 23-bit random data: floats
on x86 have some form of optimization to look at the exponent first, and
then the coefficient afterwards. On x86 systems, the comparison is
extremely slow if the exponents are the same. With 23-bit random data, all
the exponents are the same. With 32-bit random data, there will be 512
different values for the exponents, splitting the identical-exponent
problems much smaller, to the point where they are close to the size of the
cache (in-cache sorting is extremely fast).
With realistic data, there will be multiple exponents, but not as many as
there are data elements. More likely, there will be a small range of
exponents, and for each of them, many elements with the same exponent.
Also, a denial-of-service attack could easily be made against a system that
sort floats by just feeding it floats with the same exponent, but different
When I test with 26-bit random data, where there are 8 different exponent
values (a more realistic case), here are my results:
sorting floats as integers (my approach): 1.473s
(I used randomgen 10 16 10000000 to generate the input data)
So not 120X, but still a substantial difference.
Also worth noting is that when the element count increases, the speed
difference increases, even on 32-bit random data.
randomgen 16 16 200000000:
sorting floats as integer: 21.967
Which concurs with the splitting-on-exponent argument.
I'll test with SSE and other compiler settings next. Is there an easy way
to add these with bjam? Otherwise I'll just use MSVC directly.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk