Boost logo

Boost :

Subject: Re: [boost] [Multiprecision] Benchmarking
From: Jeffrey Lee Hellrung, Jr. (jeffrey.hellrung_at_[hidden])
Date: 2012-06-14 13:12:12


On Thu, Jun 14, 2012 at 8:40 AM, Phil Endecott <
spam_from_boost_dev_at_[hidden]> wrote:

> Dear All,
>
> I have attempted to evaluate the proposed multiprecision library by
> applying it to the "Dealaunay Flip" function for 32-bit integer
> coordinates, as this is one of the few things that I've ever implemented
> that needed more than 64-bit integers:
>
> inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by,
> int32_t cx, int32_t cy, int32_t dx, int32_t dy)
> {
> // Test whether the quadrilateral ABCD's diagonal AC should be flipped to
> BD.
> // This is the Cline & Renka method.
> // Flip if the sum of the angles ABC and CDA is greater than 180 degrees.
> // Equivalently, flip if sin(ABC + CDA) < 0.
> // Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0
> // We can use scalar and vector products to find sin and cos, and simplify
> // to the following code.
> // Numerical robustness is important. This code addresses it by
> performing
> // exact calculations with large integer types.
>
> i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by,
> cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy;
>
> i64 cos_abc = (ax64-bx64)*(cx64-bx64) + (ay64-by64)*(cy64-by64);
> i64 cos_cda = (cx64-dx64)*(ax64-dx64) + (cy64-dy64)*(ay64-dy64);
> if (cos_abc >= 0 && cos_cda >= 0) return false;
> if (cos_abc < 0 && cos_cda < 0) return true;
>
> i64 sin_abc = (ax64-bx64)*(cy64-by64) - (cx64-bx64)*(ay64-by64);
> i64 sin_cda = (cx64-dx64)*(ay64-dy64) - (ax64-dx64)*(cy64-dy64);
>
> i128 sin_sum = static_cast<i128>(sin_abc)***cos_cda
> + static_cast<i128>(cos_abc)***sin_cda;
>
> return sin_sum < 0;
> }
>
>
> As you can see, with 32-bit inputs, the first set of computations yield
> 64-bit results and about half of calls terminate at this stage. Otherwise
> a calculation yielding a 128-bit result is carried out. In practice, the
> results of the subtractions are typically small and the 128-bit value
> rarely has more than 64 significant bits.
>
> I have compared the following types used for i64 and i128 in the code
> above:
>
> (1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for
> comparison)
> (2) int64_t and my own "quick and dirty" int128_t.
> (3) cpp_int.
> (4) int64_t and cpp_int.
> (5) cpp_int without expression templates.
> (6) int64_t and cpp_int without expression templates.
> (7) int64_t and mp_int128_t.
> (8) mpz_int.
> (9) int64_t and mpz_int.
>
> Mean times per function call are as follows:
>
> (1) 39.3 ns
> (2) 61.9 ns
> (3) 513 ns
> (4) 103 ns
> (5) 446 ns
> (6) 102 ns
> (7) 102 ns
> (8) 2500 ns
> (9) 680 ns
>
> (AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
>
> Note that:
> - GMP is slowest, despite oft-repeated claims about its good performance.
>

I think Luke made a good point about attempting to minimize allocations,
but nonetheless I suppose this might not be too surprising; I've been under
the impression that GMP is for *big* ints.

> - Expression templates make the code slower.
>

Hmmm...John: comments?

> - Even the fastest Boost.Multiprecision type, mp_int128_t, is
> significantly slower than my own quick hack.
>
> This is all a bit disappointing. Perhaps the problem is that I'm dealing
> only with 128 bits and the intended application is for much larger values?
>

I think so, and you probably identified a weak point of the library below.

> I am reminded of my experience with xint last year. See e.g. this thread:
>
> http://thread.gmane.org/gmane.**comp.lib.boost.devel/215565>
>
> The current proposal seems much better - xint was hundreds of times slower
> than my ad-hoc code - but I still find it discouraging. I suspect that the
> performance difference may be attributable to the use of sign-magnitude
> representation; as a result, the implementation is full of sign-testing
> like this:
>
> template <unsigned MinBits, bool Signed, class Allocator>
> inline void eval_subtract(cpp_int_backend<**MinBits, Signed, Allocator>&
> result, const signed_limb_type& o)
> {
> if(o)
> {
> if(o < 0)
> eval_add(result, static_cast<limb_type>(-o));
> else
> eval_subtract(result, static_cast<limb_type>(o));
> }
> }
>
> I think I understand the motivation for sign-magnitude for variable-length
> representation, but I don't see any benefit in the case of e.g. mp_int128_t.
>

John, perhaps this is something that can be configurable?

> I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit
> multiplication. Here is my version, based on 32x32->64 multiplies:
>
> inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
> {
> // Make life simple by dealing only with positive numbers:
> bool neg = false;
> if (a<0) { neg = !neg; a = -a; }
> if (b<0) { neg = !neg; b = -b; }
>
> // Divide input into 32-bit halves:
> uint32_t ah = a >> 32;
> uint32_t al = a & 0xffffffff;
> uint32_t bh = b >> 32;
> uint32_t bl = b & 0xffffffff;
>
> // Long multiplication, with 64-bit temporaries:
>
> // ah al
> // * bh bl
> // ----------------
> // al*bl (t1)
> // + ah*bl (t2)
> // + al*bh (t3)
> // + ah*bh (t4)
> // ----------------
>
> uint64_t t1 = static_cast<uint64_t>(al)*bl;
> uint64_t t2 = static_cast<uint64_t>(ah)*bl;
> uint64_t t3 = static_cast<uint64_t>(al)*bh;
> uint64_t t4 = static_cast<uint64_t>(ah)*bh;
>
> int128_t r(t1);
> r.high = t4;
> r += int128_t(t2) << 32;
> r += int128_t(t3) << 32;
>
> if (neg) r = -r;
>
> return r;
> }
>
>
> So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
>
> But we can't write this:
>
> int64_t a, b ...;
> int128_t r = a * b;
>
> because here a*b is a 64x64->64 multiply; instead we have to write
> something like
>
> r = static_cast<int128_t>(a)*b;
>
> so that our operator* overload is found; but this operator* doesn't know
> that its int128_t argument is actually a promoted int64_t, and it has to do
> an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64
> multiplies. The only chance of avoiding the extra work is if the compiler
> can do constant-propagation, but I have my doubts about that.
>

The above situation seems to be unavoidable given the present inability to
mix types, which, as John has discussed previously, is an issue he's
(understandably) completely ducking at the moment.

I wonder also if there is a non-negligible penalty for abstracting all
backend operations as, essentially, op-assign operations, as in "x = y +
z", it seems one must loop twice, once over, say, y and x to effect "x =
y", then once over x and z to effect "x += z".

> Any thoughts?
>
> My test code is here:
http://chezphil.org/tmp/boost_**mp_review.tbz2>
>
> $ make
> $ cat nn100000.dat | benchmark
>
> (Note that this needs a fix to the resize() issue reported earlier in
> order to get accurate answers, but that doesn't much alter the execution
> time.)
>
>
> Regards, Phil.
>

Thanks, Phil, for investigating the viability of using Multiprecision for
your application!

- Jeff


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk