
Boost : 
Subject: [boost] [multiprecision] Some Performance gains
From: John Maddock (boost.regex_at_[hidden])
Date: 20120802 14:27:23
Following the review of the multiprecision lib, I've been experimenting with
some of the issues raised, currently focusing on performance improvements.
Anyhow, I now have some results for improved *front end* performance, I
haven't tried to seriously optimize the cpp_int backend yet (which was one
of the criticisms), but for more on that see below.
So... by way of measuring front end performance  the "abstraction penalty"
if you like I've used a special backend type "arithmetic_backend<A>" where A
can be any builtin arithmetic type. For tests I've used Boost.Math's
special function test data for floating point types, and Phil Endecott's
Delaunay code for integer performance.
Win32 VC10 first:
Delaunay test:
int64_t :
9.45241
mp_number<arithmetic_backend<int64_t>, false> : 9.66815
Bessel Functions Test:
Time for Bessel Functions  double = 0.015859 seconds
Time for Bessel Functions  arithmetic_backend<double>  no expression
templates = 0.0225565 seconds
Time for Bessel Functions  arithmetic_backend<double> = 0.0376449 seconds
Non Central T:
Time for Noncentral T  double = 0.456739 seconds
Time for Noncentral T  arithmetic_backend<double>  no expression
templates = 0.459957 seconds
Time for Noncentral T  arithmetic_backend<double> = 1.08256 seconds
Ubutunu Linux x64, GCC4.6.1:
Delaunay Test:
int64_t 2.3969
mp_number<arithmetic_backend<int64_t>, false> 2.38334
Bessel Functions Test:
Time for Bessel Functions  double = 0.0224617 seconds
Time for Bessel Functions  arithmetic_backend<double>  no expression
templates = 0.020384 seconds
Time for Bessel Functions  arithmetic_backend<double> = 0.0220476 seconds
Non central T:
Time for Noncentral T  double = 0.622985 seconds
Time for Noncentral T  arithmetic_backend<double>  no expression
templates = 0.554527 seconds
Time for Noncentral T  arithmetic_backend<double> = 0.800129 seconds
So to all intents and purposes there's now no abstraction penalty at all,
except when using expression templates (which should be reserved for
heavyweight backends anyway).
Notice how the mp_number code is sometimes slightly faster, and sometimes
slightly slower than the native type  I've observed that completely trivial
changes to the test/measuring code can change this around  I suspect, but
can't prove that the abstraction layer changes the decisions made by the
compiler's optimizer resulting in somewhat capricious differences between
results.
With regard to the cpp_int backend, I still have work to do, however things
do actually look pretty good on Linux x64 for the Delaunay test:
Running calculations for: int64_t, int128_t (this is Phil's 128bit
specialcase code)
Number of flips = 23892000
Total execution time = 3.28337
Running calculations for: int64_t, mp_int128_t
Number of flips = 23892000
Total execution time = 3.05434
Running calculations for: int64_t, mp_int128_t (ET)
Number of flips = 23892000
Time per calculation = 3.02904e08
Running calculations for: int64_t, cpp_int
Number of flips = 23892000
Total execution time = 5.41737
On Win32, I'm still a touch behind Phil's code, however, I haven't written a
dedicated NxN>2N multiplication routine (yet):
Running calculations for: int64_t, int128_t
Number of flips = 23892000
Total execution time = 14.2189
Running calculations for: int64_t, mp_int128_t
Number of flips = 23892000
Total execution time = 24.5114
Running calculations for: int64_t, mp_int128_t (ET)
Number of flips = 23892000
Total execution time = 19.741
Running calculations for: int64_t, cpp_int
Number of flips = 23892000
Total execution time = 33.5
All the new code is in the sandbox, but I haven't documented any of the
changes yet...
Regards, John.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk