
Boost : 
Subject: [boost] [Multiprecision] Benchmarking
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 20120614 11:40:33
Dear All,
I have attempted to evaluate the proposed multiprecision library by
applying it to the "Dealaunay Flip" function for 32bit integer
coordinates, as this is one of the few things that I've ever
implemented that needed more than 64bit integers:
inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by,
int32_t cx, int32_t cy, int32_t dx, int32_t dy)
{
// Test whether the quadrilateral ABCD's diagonal AC should be
flipped to BD.
// This is the Cline & Renka method.
// Flip if the sum of the angles ABC and CDA is greater than 180 degrees.
// Equivalently, flip if sin(ABC + CDA) < 0.
// Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0
// We can use scalar and vector products to find sin and cos, and simplify
// to the following code.
// Numerical robustness is important. This code addresses it by performing
// exact calculations with large integer types.
i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by,
cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy;
i64 cos_abc = (ax64bx64)*(cx64bx64) + (ay64by64)*(cy64by64);
i64 cos_cda = (cx64dx64)*(ax64dx64) + (cy64dy64)*(ay64dy64);
if (cos_abc >= 0 && cos_cda >= 0) return false;
if (cos_abc < 0 && cos_cda < 0) return true;
i64 sin_abc = (ax64bx64)*(cy64by64)  (cx64bx64)*(ay64by64);
i64 sin_cda = (cx64dx64)*(ay64dy64)  (ax64dx64)*(cy64dy64);
i128 sin_sum = static_cast<i128>(sin_abc)*cos_cda
+ static_cast<i128>(cos_abc)*sin_cda;
return sin_sum < 0;
}
As you can see, with 32bit inputs, the first set of computations yield
64bit results and about half of calls terminate at this stage.
Otherwise a calculation yielding a 128bit result is carried out. In
practice, the results of the subtractions are typically small and the
128bit value rarely has more than 64 significant bits.
I have compared the following types used for i64 and i128 in the code above:
(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison)
(2) int64_t and my own "quick and dirty" int128_t.
(3) cpp_int.
(4) int64_t and cpp_int.
(5) cpp_int without expression templates.
(6) int64_t and cpp_int without expression templates.
(7) int64_t and mp_int128_t.
(8) mpz_int.
(9) int64_t and mpz_int.
Mean times per function call are as follows:
(1) 39.3 ns
(2) 61.9 ns
(3) 513 ns
(4) 103 ns
(5) 446 ns
(6) 102 ns
(7) 102 ns
(8) 2500 ns
(9) 680 ns
(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
Note that:
 GMP is slowest, despite oftrepeated claims about its good performance.
 Expression templates make the code slower.
 Even the fastest Boost.Multiprecision type, mp_int128_t, is
significantly slower than my own quick hack.
This is all a bit disappointing. Perhaps the problem is that I'm
dealing only with 128 bits and the intended application is for much
larger values?
I am reminded of my experience with xint last year. See e.g. this thread:
http://thread.gmane.org/gmane.comp.lib.boost.devel/215565
The current proposal seems much better  xint was hundreds of times
slower than my adhoc code  but I still find it discouraging. I
suspect that the performance difference may be attributable to the use
of signmagnitude representation; as a result, the implementation is
full of signtesting like this:
template <unsigned MinBits, bool Signed, class Allocator>
inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>&
result, const signed_limb_type& o)
{
if(o)
{
if(o < 0)
eval_add(result, static_cast<limb_type>(o));
else
eval_subtract(result, static_cast<limb_type>(o));
}
}
I think I understand the motivation for signmagnitude for
variablelength representation, but I don't see any benefit in the case
of e.g. mp_int128_t.
I also wonder about the work involved in e.g. a 64 x 64 > 128bit
multiplication. Here is my version, based on 32x32>64 multiplies:
inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
{
// Make life simple by dealing only with positive numbers:
bool neg = false;
if (a<0) { neg = !neg; a = a; }
if (b<0) { neg = !neg; b = b; }
// Divide input into 32bit halves:
uint32_t ah = a >> 32;
uint32_t al = a & 0xffffffff;
uint32_t bh = b >> 32;
uint32_t bl = b & 0xffffffff;
// Long multiplication, with 64bit temporaries:
// ah al
// * bh bl
// 
// al*bl (t1)
// + ah*bl (t2)
// + al*bh (t3)
// + ah*bh (t4)
// 
uint64_t t1 = static_cast<uint64_t>(al)*bl;
uint64_t t2 = static_cast<uint64_t>(ah)*bl;
uint64_t t3 = static_cast<uint64_t>(al)*bh;
uint64_t t4 = static_cast<uint64_t>(ah)*bh;
int128_t r(t1);
r.high = t4;
r += int128_t(t2) << 32;
r += int128_t(t3) << 32;
if (neg) r = r;
return r;
}
So I need 4 32x32>64 multiplies to perform 1 64x64>128 multiply.
But we can't write this:
int64_t a, b ...;
int128_t r = a * b;
because here a*b is a 64x64>64 multiply; instead we have to write
something like
r = static_cast<int128_t>(a)*b;
so that our operator* overload is found; but this operator* doesn't
know that its int128_t argument is actually a promoted int64_t, and it
has to do an 128x64>128 or 128x128>128 multiply, requiring 8 or 16
32x32>64 multiplies. The only chance of avoiding the extra work is if
the compiler can do constantpropagation, but I have my doubts about that.
Any thoughts?
My test code is here: http://chezphil.org/tmp/boost_mp_review.tbz2
$ make
$ cat nn100000.dat  benchmark
(Note that this needs a fix to the resize() issue reported earlier in
order to get accurate answers, but that doesn't much alter the
execution time.)
Regards, Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk