|
Boost : |
Subject: Re: [boost] [Multiprecision] Benchmarking
From: John Maddock (boost.regex_at_[hidden])
Date: 2012-06-14 13:01:12
> I have compared the following types used for i64 and i128 in the code
> above:
>
> (1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for
> comparison)
> (2) int64_t and my own "quick and dirty" int128_t.
> (3) cpp_int.
> (4) int64_t and cpp_int.
> (5) cpp_int without expression templates.
> (6) int64_t and cpp_int without expression templates.
> (7) int64_t and mp_int128_t.
> (8) mpz_int.
> (9) int64_t and mpz_int.
>
> Mean times per function call are as follows:
>
> (1) 39.3 ns
> (2) 61.9 ns
> (3) 513 ns
> (4) 103 ns
> (5) 446 ns
> (6) 102 ns
> (7) 102 ns
> (8) 2500 ns
> (9) 680 ns
>
> (AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
>
> Note that:
> - GMP is slowest, despite oft-repeated claims about its good performance.
GMP's performance only kicks in for large values - for tests like this
memory allocation dominates.
> - Expression templates make the code slower.
For sure - expression templates are about eliminating temporaries - if the
temporaries are cheap (cpp_int containing small values and no dynamic memory
allocation) then there's no benefit and probably a hit from bringing in all
that machinary. That's why expression templates are disabled for the fixed
size cpp_int typedefs.
On the other hand, if you tried mpz_int without expression templates that
should be even slower than the code with them - in that case eliminating
temporaries does bring a benefit from fewer memory allocations.
> - Even the fastest Boost.Multiprecision type, mp_int128_t, is
> significantly slower than my own quick hack.
>
> This is all a bit disappointing. Perhaps the problem is that I'm dealing
> only with 128 bits and the intended application is for much larger values?
No the problem is that it's a general purpose solution, in almost any
special purpose case it will be slower.
Here's the thing - once the compiler's optimiser gets involved your
multiplication routine is unlikely to ever be beaten by more than a trivial
amount (a few persent at best), it's also sufficiently simple, that even the
costs of a few "if" statements would likely acount for the extra time.
So believe it or not, I'm actually quite pleased it's no more than 2x
slower!
> I am reminded of my experience with xint last year. See e.g. this thread:
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/215565
>
> The current proposal seems much better - xint was hundreds of times slower
> than my ad-hoc code - but I still find it discouraging. I suspect that
> the performance difference may be attributable to the use of
> sign-magnitude representation; as a result, the implementation is full of
> sign-testing like this:
>
> template <unsigned MinBits, bool Signed, class Allocator>
> inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>&
> result, const signed_limb_type& o)
> {
> if(o)
> {
> if(o < 0)
> eval_add(result, static_cast<limb_type>(-o));
> else
> eval_subtract(result, static_cast<limb_type>(o));
> }
> }
>
> I think I understand the motivation for sign-magnitude for variable-length
> representation, but I don't see any benefit in the case of e.g.
> mp_int128_t.
The sign testing above is because you're adding a signed int, it has nothing
to do with sign-magnitude representation, if your code deals only with
positive values, and had you used unsigned integers throughout, then those
tests would disappear.
BTW the code originally used a 2's complement representation for fixed
precision ints - that code is (very slightly) faster if you really want to
do an NxN bit operation, but slower if most of the integers aren't using all
the internal limbs.
> I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit
> multiplication. Here is my version, based on 32x32->64 multiplies:
>
> inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
> {
> // Make life simple by dealing only with positive numbers:
> bool neg = false;
> if (a<0) { neg = !neg; a = -a; }
> if (b<0) { neg = !neg; b = -b; }
>
> // Divide input into 32-bit halves:
> uint32_t ah = a >> 32;
> uint32_t al = a & 0xffffffff;
> uint32_t bh = b >> 32;
> uint32_t bl = b & 0xffffffff;
>
> // Long multiplication, with 64-bit temporaries:
>
> // ah al
> // * bh bl
> // ----------------
> // al*bl (t1)
> // + ah*bl (t2)
> // + al*bh (t3)
> // + ah*bh (t4)
> // ----------------
>
> uint64_t t1 = static_cast<uint64_t>(al)*bl;
> uint64_t t2 = static_cast<uint64_t>(ah)*bl;
> uint64_t t3 = static_cast<uint64_t>(al)*bh;
> uint64_t t4 = static_cast<uint64_t>(ah)*bh;
>
> int128_t r(t1);
> r.high = t4;
> r += int128_t(t2) << 32;
> r += int128_t(t3) << 32;
>
> if (neg) r = -r;
>
> return r;
> }
>
>
> So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
>
> But we can't write this:
>
> int64_t a, b ...;
> int128_t r = a * b;
>
> because here a*b is a 64x64->64 multiply; instead we have to write
> something like
>
> r = static_cast<int128_t>(a)*b;
>
> so that our operator* overload is found; but this operator* doesn't know
> that its int128_t argument is actually a promoted int64_t, and it has to
> do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64
> multiplies. The only chance of avoiding the extra work is if the compiler
> can do constant-propagation, but I have my doubts about that.
No: the fixed precision cpp_int's know how many limbs are actually used: so
the same number of multiplies will still be used as in your code. The
problem is partly the "abstraction penalty", and partly all the special case
handling inside cpp_int.
I will look at your test case, but I'd be very pleasantly surprised if
there's any more low-hanging fruit performance wise.
Thanks for the comments, John.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk