
Boost : 
Subject: Re: [boost] [Multiprecision] Benchmarking
From: John Maddock (boost.regex_at_[hidden])
Date: 20120614 13:01:12
> I have compared the following types used for i64 and i128 in the code
> above:
>
> (1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for
> comparison)
> (2) int64_t and my own "quick and dirty" int128_t.
> (3) cpp_int.
> (4) int64_t and cpp_int.
> (5) cpp_int without expression templates.
> (6) int64_t and cpp_int without expression templates.
> (7) int64_t and mp_int128_t.
> (8) mpz_int.
> (9) int64_t and mpz_int.
>
> Mean times per function call are as follows:
>
> (1) 39.3 ns
> (2) 61.9 ns
> (3) 513 ns
> (4) 103 ns
> (5) 446 ns
> (6) 102 ns
> (7) 102 ns
> (8) 2500 ns
> (9) 680 ns
>
> (AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
>
> Note that:
>  GMP is slowest, despite oftrepeated claims about its good performance.
GMP's performance only kicks in for large values  for tests like this
memory allocation dominates.
>  Expression templates make the code slower.
For sure  expression templates are about eliminating temporaries  if the
temporaries are cheap (cpp_int containing small values and no dynamic memory
allocation) then there's no benefit and probably a hit from bringing in all
that machinary. That's why expression templates are disabled for the fixed
size cpp_int typedefs.
On the other hand, if you tried mpz_int without expression templates that
should be even slower than the code with them  in that case eliminating
temporaries does bring a benefit from fewer memory allocations.
>  Even the fastest Boost.Multiprecision type, mp_int128_t, is
> significantly slower than my own quick hack.
>
> This is all a bit disappointing. Perhaps the problem is that I'm dealing
> only with 128 bits and the intended application is for much larger values?
No the problem is that it's a general purpose solution, in almost any
special purpose case it will be slower.
Here's the thing  once the compiler's optimiser gets involved your
multiplication routine is unlikely to ever be beaten by more than a trivial
amount (a few persent at best), it's also sufficiently simple, that even the
costs of a few "if" statements would likely acount for the extra time.
So believe it or not, I'm actually quite pleased it's no more than 2x
slower!
> I am reminded of my experience with xint last year. See e.g. this thread:
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/215565
>
> The current proposal seems much better  xint was hundreds of times slower
> than my adhoc code  but I still find it discouraging. I suspect that
> the performance difference may be attributable to the use of
> signmagnitude representation; as a result, the implementation is full of
> signtesting like this:
>
> template <unsigned MinBits, bool Signed, class Allocator>
> inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>&
> result, const signed_limb_type& o)
> {
> if(o)
> {
> if(o < 0)
> eval_add(result, static_cast<limb_type>(o));
> else
> eval_subtract(result, static_cast<limb_type>(o));
> }
> }
>
> I think I understand the motivation for signmagnitude for variablelength
> representation, but I don't see any benefit in the case of e.g.
> mp_int128_t.
The sign testing above is because you're adding a signed int, it has nothing
to do with signmagnitude representation, if your code deals only with
positive values, and had you used unsigned integers throughout, then those
tests would disappear.
BTW the code originally used a 2's complement representation for fixed
precision ints  that code is (very slightly) faster if you really want to
do an NxN bit operation, but slower if most of the integers aren't using all
the internal limbs.
> I also wonder about the work involved in e.g. a 64 x 64 > 128bit
> multiplication. Here is my version, based on 32x32>64 multiplies:
>
> inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
> {
> // Make life simple by dealing only with positive numbers:
> bool neg = false;
> if (a<0) { neg = !neg; a = a; }
> if (b<0) { neg = !neg; b = b; }
>
> // Divide input into 32bit halves:
> uint32_t ah = a >> 32;
> uint32_t al = a & 0xffffffff;
> uint32_t bh = b >> 32;
> uint32_t bl = b & 0xffffffff;
>
> // Long multiplication, with 64bit temporaries:
>
> // ah al
> // * bh bl
> // 
> // al*bl (t1)
> // + ah*bl (t2)
> // + al*bh (t3)
> // + ah*bh (t4)
> // 
>
> uint64_t t1 = static_cast<uint64_t>(al)*bl;
> uint64_t t2 = static_cast<uint64_t>(ah)*bl;
> uint64_t t3 = static_cast<uint64_t>(al)*bh;
> uint64_t t4 = static_cast<uint64_t>(ah)*bh;
>
> int128_t r(t1);
> r.high = t4;
> r += int128_t(t2) << 32;
> r += int128_t(t3) << 32;
>
> if (neg) r = r;
>
> return r;
> }
>
>
> So I need 4 32x32>64 multiplies to perform 1 64x64>128 multiply.
>
> But we can't write this:
>
> int64_t a, b ...;
> int128_t r = a * b;
>
> because here a*b is a 64x64>64 multiply; instead we have to write
> something like
>
> r = static_cast<int128_t>(a)*b;
>
> so that our operator* overload is found; but this operator* doesn't know
> that its int128_t argument is actually a promoted int64_t, and it has to
> do an 128x64>128 or 128x128>128 multiply, requiring 8 or 16 32x32>64
> multiplies. The only chance of avoiding the extra work is if the compiler
> can do constantpropagation, but I have my doubts about that.
No: the fixed precision cpp_int's know how many limbs are actually used: so
the same number of multiplies will still be used as in your code. The
problem is partly the "abstraction penalty", and partly all the special case
handling inside cpp_int.
I will look at your test case, but I'd be very pleasantly surprised if
there's any more lowhanging fruit performance wise.
Thanks for the comments, John.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk