Mathematical Constants

Introduction

These files are a collection of very accurate mathematical constants for C ++ (and C) programs.
These are like to be useful even by programs just computing areas of circles!

Mathematical constants (like Archimedes' pi) are exceedingly unlikely to change, whereas estimates of physical ‘constants‘ (like Planck's h) are certain to vary.
Physical constants usually have a known uncertainty, usually greater than even the accuracy of float representations, whereas mathematical constants can be calculated to an arbitrarily high precision.
Therefore to avoid confusion, mathematical and physical constants will never be stored in the same header file.

The mathematical constants collection is convenient in that it does not require a search of reference sources and is useful because it ensures that the values are as accurate as possible and as portable as possible between different systems.

A special feature of this collection is that the constants values are a little more accurate than can be represented by current and foreseeable floating-point hardware. Upper and lower smallest interval values also provided for many popular floating-point formats: these will be useful when using interval arithmetic, for example to estimate computational uncertainty. Interval limit values are exactly representable to aid portability. Small differences in constants can lead to puzzling differences between computations. The values given are slightly, but significantly, more accurate than is usually achieved by using the compiler to calculate constants.

The constants have all been calculated using software working with 300-bit precision giving about 100 decimal digits. (The precision can be chosen and is limited only by compute time). The precision selected (40 decimal digits) exceeds the accuracy of foreseeable floating-point hardware.

The objective is to achieve the full accuracy possible for all real-life computations. This has no extra cost to the user, but reduces irritating, and often confusing and hard-to-trace effects, caused by the intrinsically limited precision of floating-point calculations. At least these show as a spurious least-significant digit; at worst slightly inaccurate constants cause algorithms to fail because internal comparisons just fail.

File Formats

Different applications need constants in different formats. Six types of file are provided:

1. Files which contain C Macros which #define constants and matching #undefs. These are less likely to be useful for C++ programs, but may suit C programs or embedded systems where no inefficiency is using memory can be tolerated.

2. Files that contain float const constants. Useful if only float precision is required.

3. Files that contain double const constants. Most useful for most purposes.

4. Files that contain long double const constants. Useful where long double precision is needed.

5. Files that contain constants defined as functions, so called by pi(), to help compilers optimise for speed.

6. Files that contain class constant template function constants, but called by pi as well as pi(). This form is useful where more than one precision (for example, both float and double) is needed, and to facilitate optimisation for speed by compilers.

7. Files that contain exactly representable upper and lower limits of the smallest interval containing the constant. The limits differ for different floating-point formats depending on the radix or base, and number of significand bits. The limits are selected, by default, using std::numeric_limits<floating_point_type>::digits and std::numeric_limits<floating_point_type>::radix values, if available.

More than one header file may be used, but including two files in the same module will lead to confusing results. This might happen implicitly by including a header that in turn includes another constants header file. No protection against this is provided and may result in baffling compilation errors and/or unexpected selection of constant representation. You have been warned!

All files contain exactly the same constants and identical numeric values with 40 decimal digits (except interval values which have enough decimal digits to be exactly representable).

Which constants?

The choice of constants is necessarily arbitrary and inevitably contentious. The constants chosen are widely used in calculation of mathematical functions and include most of those used by several authors of mathematical function libraries or texts referenced below, for example Knuth, Cody, Morris, Hart and Moshier.

To avoid including an excessive number of unwanted constants, the constants are subdivided into several groups of files, whose exact distribution and content is yet to be finally agreed.

The first group contains the most fundamental constants that cannot be easily calculated, like pi and e.

The second group contains constants that are widely used. Although compilers may calculate them, some compilers may lose some accuracy for some constants.

Finally more obscure constants, and those that can probably be calculated without loss (for example, 2. * pi) but are commonly used and so are convenient. Note that upper and lower interval limits are especially at risk of being inaccurately calculated using limits of more fundamental constants, with multiplication and division by factors of the radix the only safe exception. Explicitly calculated exactly representable values are safer.

How to use these files?

The file define_constants.h consists of a list of C style MACROs that define a collection of mathematical constants very accurately, for example:

   #define BOOST_PI  3.14159265358979323846264338327950288L /* pi */

Individual MACRO value(s) can be used directly, for example:

  long double const PI = BOOST_PI;

Or with casting to other types, for example:

   double const PI = (long double)BOOST_PI;

(Note that to avoid any loss of precision for long double definitions, the MACRO constants are defined as long double by the suffix L, but this means that casting is needed or desirable to avoid errors or warnings for less precise types double or float.)

  double const pi = (double)BOOST_PI; /* C style cast */

Or C++ static cast down to double and/or float precisions.

long double const pi = static_cast<double>(BOOST_PI); // C++ static_cast
long float const pi_f = static_cast<float>(BOOST_PI); // C++ static_cast

To document helpfully, it is sensible to static_cast all definitions, for example:

  long double const pi_f = static_cast<long double>(BOOST_PI);

(Note one needs two different names, pi and pi_f, if different precisions must coexist (unless one uses constant class template functions described below).

Unlike most header files, the macros file can be used in many ways in C and C++ (and indeed in other languages).
(Note that is it deliberately a .h file and not the Boost convention of a .hpp file. This is to aid it to be used in C programs, as well as C++ programs).

Corresponding files that contain matching #undef statements called undefine*.h are also provided to reduce namespace pollution and the risk of name clashes or confusion.

2 Files (of type .hpp) containing constants as const float, const double and const long double values can be #included

or, dare one suggest, individual literal value(s) strings can be copied and pasted into programs, for example:

  float const pi = 3.14159265358979323846264338327950288F;

(Note that the suffix is F to avoid the need to static_cast long double to float).

long double const pi = 3.14159265358979323846264338327950288L;

(Note that the suffix is L because the literal should be a long double).

The header file constants.h can be included in a separate translation unit:

  #include “constants.h”

  extern long double const pi = static_cast<double>(BOOST_PI); // C++ static cast

  extern long double const e = static_cast<double>(BOOST_E); // C++ static cast

will make only pi and e visible to other translation units through the linker. The names can be chosen to avoid name clashes.

4 The simpler scheme defining constants within a translation unit

  long double const pi = static_cast<double>(BOOST_PI); // C++ static cast

may cause some name clashes, so a matching file undef_constants.h is provided to undefine ALL the macros defined by define_constants.h. This undefined file should be included immediately after the last reference to any BOOST macro constant.

5 When multiple precision constants are needed in the same program, (and namespace pollution is to be avoided) class constant template functions in files like function_constants.hpp can be used:

  // Select an entire set of constants with a single using directive:

  using namespace boost::math::double_constants;

  // Or float constants...

  // using namespace boost::math::float_constants;

  // So naive users (mea culpa) can just say "pi",

  // and get boost::math::double_constants::pi.

  cout << "pi = " << pi << endl;

  double r = 2.;

  cout << "area = " << pi * r * r << "\n";

  // Naughty users CAN'T write pi = 3.; !!! // - they get a cryptic error message:

  // C2678: binary '=' : no operator defined which takes a left-hand operand of type

  // 'const struct boost::math::constant<struct boost::math::pi_tag,double>'

  // Or they can explicitly access fully specified constants at any time:

  cout.precision(ceil(1 + std::numeric_limits<double>::digits*log10(2.)));

  // Ensure all significant digits for double are displayed, or

  cout << "double pi = " << boost::math::double_constants::pi << endl;

  cout.precision(std::numeric_limits<double>::digits10);

  // Ensure all guaranteed digits for double are displayed.

  cout << "double pi = " << boost::math::double_constants::pi << endl;

  cout.precision(ceil(1 +std::numeric_limits<long double>::digits) *log10(2.)) ;

  cout << "long double pi = " << boost::math::long_double_constants::pi << "\n";

  using namespace boost::math;

  cout.precision(ceil(1 + long_double_significand_digits * log10(2.));

  cout << "float pi = " << float_constants::pi << endl;

  cout.precision(ceil(1+std::numeric_limits<double>::digits10)*log10(2.));

  cout << "double pi = " << double_constants::pi << endl;

  // But CANNOT switch to float constants.

  // using namespace boost::math::float_constants;

  // cout << "pi = " << pi << endl; // error C2872: 'pi' : ambiguous symbol

  // This is probably a useful safeguard.

  // For types that are not specialised, one could write:

  const short pi_I = (short)boost::math::float_constants::pi;

  cout << "short pi = " << pi_I << endl;  // 3!

  // Access via the templated function call may allow better optimisation.

  float pi_float = boost::math::constant<pi_tag, float >();

  // If a template is not specialised, then it fails at link time.

Michael Kenniston devised this scheme (after much discussion on Boost.org) for the following reasons:

Only need to write one interface (not counting macros-for-C, which is really a separate library from the C++ stuff)

1. No macros: all names are in namespaces.

2. Average users can switch "pi" between float and double by changing one "using" directive.

3. No "double rounding" (i.e. no multiple conversions).

4. Platform implementers can use whatever evil incantations they desire for maximum accuracy.

5. Everything is implemented with inline functions to ease constant folding.

6. It is (relatively) easy to extend with new constants and representation types.

7. Purists can access all constants via templated functions.

8. No "static-initialization-order" issues, because the static variables contain no data to initialise.

9. No partial template specialization, so it works on MSVC.

10. No non-type template parameters, so it should work on Borland.

(The same basic techniques will probably work for converting read access to any statically allocated class object to an inline function call, including in particular physical constants with units/dimensions).

The disadvantage of this method is that some compilers in strict mode seem to insist on explicit definition of the empty class constant and this leads to much clutter and potential code-bloat because a constructor is instantiated for each and every of the many constants. Some compilers are able to optimise these empty and unused constructors away, but may not do so in debug mode. Other compilers (and some in non-strict mode) do not require the definition of the empty constant class. The C++ Standard ISO 14882:1998 does not make clear whether this is a language ambiguity or an excessive compiler requirement. However, the net result is that this method may not be ideally efficient or portable.

6 C++ allows other methods of avoiding name collisions, for example using named and unnamed or anonymous namespaces.

Naming Convention

Discussion about naming convention made very clear that it is not possible to please everyone!

In general, a Standard Library C++ style is used, favouring longer and easier to read names over ease of typing, and all lower case and underscore _ as separator. Of course, names cannot start with a digit, so words, like two and half, are used instead. C99 names are used where appropriate, but names are not in all capitals as used in many C programs using the rationale that these names are variable names and not macro or #define names. Some examples, and the rationale for them, should make this clearer than an attempt to write a formal specification. Of course there are always problems with more complex constants where brackets would be used.

pi // Prefer lower case only - use uppercase only for class_names.
ln_pi // clearer than lnpi
sqrt_2 // shorter than sqrt_two and clearer than sqrt2.
// sqrt is commonly used and shorter than square_root_2.
cbrt_2 // ugly, but is C99 name.
ten // can’t be 10 so use word instead.

third // shorter than one_div_three.
one_div_sqrt_2 // one_div rather than reciprocal – short but clear.
// _div rather than _on or _upon.
half_pi // rather than pi_div_2.
pi_sqr // pi_squared rather long.
two_pow_three_halves // two_pow_three_div_two is ambiguous.
minus_ln_ln_2 // can't start with - so use word.

Explicit typed names

pi_f // float

pi_d // double

pi_l // long double

pi_i // integer

Interval limit static const constant variable names

pi // nearest to pi (usually either upper and/or lower interval limit)

pi_l // pi interval lower limit

pi_u // pi interval upper limit

pi_f_l // pi float lower limit

pi_l_l // pi long double lower limit

pi_l_u // pi long double upper limit

Why not use the Compiler and/or Preprocessor?

The advantages of pre-computed constants are in accuracy and precision, reliability, and most important in portability.

It may be worth using single constants rather than using the preprocessor and/or compiler to calculate a derived value. For example, it may be a little better to use third_pi rather than pi /3. This is because the single constants are produced with a higher precision than the floating-point hardware and/or software that the compiler is also certain to use for its computation. (The differences are especially significant for trig and pow functions!)

The constants are entirely fixed at compile time, so there is no code to call functions, for example, log, nor the need to include header files which define those functions, for example <cmath>. This may speed compilation and reduce file accesses and dependency, and facilitate optimisation. When debugging code (perhaps with less optimisation) the constant is never calculated at run-time. Behaviour when debugging is independent of compiler type, version and compile options.

There is no evaluation of the function at program start time, nor wasted evaluation if the constant is never used, so programs will be a bit quicker.

Portability is improved because as there is no compile or run-time evaluation of constants, there can be no difference caused by different versions of functions, for example in different version of C math libraries, nor in different version of hardware floating point evaluation. Although in reality, this is rather unlikely to be a problem, the effort and time wasted proving that it is not a problem is likely to be significant.

The most potentially serious loss of accuracy occurs with transcendental functions: most seriously, the exponentiation function x raised to power y, pow(x, y) for x**y or x^y where y may not be an integer.

Although compilers should be able to calculate constant expressions, for example 1. / 3., they may do so with varying accuracy. To achieve the accuracy of the pre-computed constants, the compiler would need to use similar high-precision software rather than the natural hardware built-in types like double. Few if any compilers yet do this, and doing this would significantly slow compilation. Compilers may also only do these calculations when optimisation is selected, giving confusingly slightly different results for debug and release versions. If new compilers do calculate more accurate values, then results from recompiled code may differ slightly, but mysteriously.

If the Standard C float.h is available, the macro FLT_MANT_DIG may provide the number of digits (usually radix 2) in the mantissa (significand) that limits precision (but not the maximum or minimum values that are constrained mainly by the maximum exponent). (Macro digits and mueric_limits<FPType>::digits include any hidden or implicit significand bits as these contribute to precision, so it is not sufficient to count the bits in the floating point layout.)

  /* The number of base FLT_RADIX digits in the mantissa. */

  #define _DBL_RADIX      2    /* exponent radix */

  #define DBL_MANT_DIG    53   /* # of bits in mantissa */

  #define LDBL_MANT_DIG   64   /* # of bits in mantissa */

The same values are available using Standard C++ <limits>, for example as:

 int double_radix = std::numeric_limits< double>::radix; // usually 2 for IEEE FP

 int double_mantissa_digits = std::numeric_limits<double>::digits;

 // Usually 53 for IEEE 64-bit double.

 int double_digits10 = std::numeric_limits<long double>::digits10;

 // 15 decimal digits guaranteed output correctly for 64-bit double.

(But the 128-bit representation used by the AIX OS warns that that this value, defined as 106-bit for AIX long double, may be not tell the whole story – the actual number of bits of precision can vary. There are even worse difficulties in calculating interval values for non-standard floating point formats and values given may not be fully reliable.)

To ensure no loss of accuracy when reading in, all floating point constants should be specified with at least a few decimal digits MORE than the value of ISO Standard 14882:1998 C++ Library limits for std::numeric_limits<type>:: which are only the number guaranteed to be correct on output . For example std::numeric_limits<double>::digits10 (equivalent to Standard C Library float.h DBL_DIG) is 15 for double using IEEE 754/IEC559 standard 64-bit real, 11-bit exponent, 52+1 = 53 bit significand (mantissa) using Motorola 680X0 or Intel 8087 floating point coprocessor as incorporated in all Pentium chips. So double constants must be specified with at least 17 decimal digits. (Even many more decimal digits may be required to ensure that interval values are exactly representable – see below).

William Kahan in his report on the status of IEEE754 in 1996 (reference 23) gives two significant digits values for each format:

The lower value is at least
floor ((n - 1) * log10(2) significant digits, std::numeric_limits< >::digits10)
and the upper,
ceil(1 + n * log10(2)) significant digits,
where n is number of significand bits (std::numeric_limits< >::digits).

For float, floor (24 –1) * 0.301 = 6 &
ceil(1 + 24 * 0.301) = 9,

& for double 15 and 17, extended (long double?) 18 & 21, quadruple 33 & 36). For example, for float:

If a decimal string with at most the lower significant digits (6) is converted to float precision and the converted back to the same number of significant digits (6), then the string should match the original 6 digits. This is

int digits10 = int(floor(-1) std::numeric_limits<float>::digits * log10(2.))

that should be the same as

std::numeric_limits<float>::digits10

and macro value FLT_DIG.

If a float value is converted to a decimal string with at least the upper number of significant digits (9 for float), and then converted back to float, then the final number must match the original float number. For example, using Standard C++

#include <iostream>
#include <limits>

double const log10Two = 0.3010299956639811952137388947; // log10(2.)
pi = 3.1415926535897932384626433832795F;
int sig_digits10 =
int( ceil(1 + std::numeric_limits<float>::digits * 0.3010299956639811952137388947));
std::cout.precision(decimal_digits);
std::cout <<"float pi = " << pi << " using " <<
sig_digits10 << " decimal digits" << endl;

(But note that MACRO calculations only use integer arithmetic and cannot use floating-point functions ceil and floor. An approximation using only integer arithmetic is

2 + MANT_DIG * 30103 /100000

that gives the expected result for the significand digits listed below).

Note the asymmetry: the upper value is the number of digits that are significant on output and are required for accurate input, the lower value is the number guaranteed correct on output. The practically useful precision is given by bits/log10(2.)

Floating point Type	Often used for C/C++ type	Total bits	Significand bits (+ 1 if an implicit bit)	guaranteed digits10 decimal digits	significant decimal digits
IEEE single	float	32	23 + 1 = 24	6	9
VAX F	float	32	23 + 1 = 24	6	9
IBM G/390 short	float	32	24	6	9
IEEE single extended	?	>=43	>=32	7	11
IEEE double	double	64	52 + 1 = 53	15	17
VAX G	double	64	52 + 1 = 53	15	17
IBM G/390 long	double	64	56	15	17
VAX D	long double	64	56	15	17
IEEE double extended	long double	80	>=64	19	21
Sparc doubleextended (x86)	long double	80	64	18	21
AIX quad	long double	128	106	31	33
NTL quad	‘quad’	128	106	31	33
IBM G/390 extended	long double	128	112	33	35
VAX H	long double	128	112 + 1 = 113	34	36
IEEE quadruple	long double	128	112 + 1 = 113	34	36
Sparc double extended	long double	128	112 + 1 = 113	34	36
signed fractional	?	127	128	38	40
unsigned fractional	?	128	128	38	40
unsigned fractional	?	128	128 + 1 = 129	38	40

Semi-exhaustive tests using MSVC++.NET (compiler version 7.0) with C++ float and double types confirm that these numbers of significant digits can be output and re-input as expected.

For example, the following code does not fail the assert test for any normalised float value.

        float a = 1.F;

        float aa = 0;

        std::stringstream s;

        s.precision(9);

        s << scientific << a;

        s >> aa;

        assert (a != aa);

All the constants here are believed reliable to 40 decimal digits and so should meet all these floating point formats, including C++ long double extended precision if the compiler uses IEEE 80-bit real, 15-bit exponent, 64-bit significand (or mantissa) format for which constants will need 21 decimal digits, and will even meet the requirements for 128-bit real arithmetic providing significand (mantissa) accuracy of 113-bit used by the Sun Sharc processor, and the AIX OS 106-bit, and floating-point types which can be emulated in software using, for example, C++ NTL quad_float class by Victor Shoup (derived from the doubledouble class of Keith Briggs).

The Intel Itanium IA-64 128-bit quad_precision is planned to be implemented in hardware (with 112-bit significand, 15 exponent, and sign), slightly more precise than 36 decimal digit constants required by other formats above. See www.intel.com , ADAG.pdf for IA-64 Application Developers Architecture Guide.

If all 128 bits are used as a fraction or significand (perhaps without sign and with an implicit most significant bit), 40 decimal digits would still be sufficient.

Floating-point hardware is unlikely to provide higher precision for some time to come. (For applications that require a higher precision, several software solutions are available, for example NTL, MPFUM, GMP, LIDIA - see references below).

Sadly, but perhaps inevitably, the ISO C++ standard 14882:1998 [2.13.3.1] states:

" [...] If the scaled value is in the range of representable values for its type, the result is the scaled value if representable else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner. [...] "

The phrase “implementation-defined manner” implies that the compiler is not required to round to the nearest internal representation, although all compilers used so far appear to round as expected.

The effect of this is that if the number is not exactly representable (for example, if radix = 2 then 10. is not, but 2. is) then different implementations may convert the same decimal digits to binary values that differ by one least significant bit (or digit if radix is not two, for example ten or sixteen). The constants interval upper and lower limits are these two values, differing by at most one least significant bit (but the identical if the number, for example 2, is exactly representable). To ensure that they are converted from decimal digits to the internal format, these interval limits must be exactly representable. Many decimal digits may be required to achieve this – nearly the number of bits in the significand.

So with sufficient decimal digits, compilers should convert constants from decimal digit characters to their internal floating-point format within 1 least significant bit (or digit).

Not all compilers achieve this and so some programs, for example the Cephes mathematical functions library (reference 17), have provided constants (especially polynomial coefficients) in hexadecimal format (also proposed for future version of C and C++). This ensures that the values are stored exactly, but also requires different hexadecimal values for each processor and floating-point format.

The method of providing decimal character digits is more portable but it does rely on compilers making the conversion from decimal character digits to internal floating-point binary format conforming to the C++ Standard. In both cases, the results of storage and computation depend on the floating-point representation and so none are fully portable, as explained below.

(The unusually large number of decimal digits might cause overflow of non-compliant compilers that should accept, but ignore, unlimited numbers of digits. The only workaround for constant values is to round the values by hand or to use hexadecimal constants for the intervals. This package does not attempt to cater for this erroneous behaviour.)

Hardware Built-in Constants

Intel X86 floating-point hardware, for example, includes instructions for providing floating-point constants pi and log base 2 (10). These are accurate to 64 significand bits, including guard and sticky bits, which is higher than for any constant read into a register or memory by the compiler, even from a more precise constant decimal digit string.

Internally, the Intel X86 floating-point hardware holds all numbers in extended precision real format (64 significand bits compared to external double of 53 significand bits). Extra bits are mainly to support rounding mode correctly. Calculations performed using floating point register to register operations will be performed with this 64-bit accuracy. But operations that use external memory will have lower accuracy. The compiler makes the choice between register and memory, so expecting completely portable accuracy is probably too optimistic. Using accurate literal constants probably gives the best chance of portability (but not necessarily the best possible accuracy).

Should calculations be exactly the same on all compilers and platforms?

Alas the complexity of floating point calculations is such that calculations cannot be absolutely guaranteed to give the same result, even if both use hardware that conforms to the IEEE 754/IEC559 floating-point specification. This is most likely to be a significant problem when iterative calculations at the limit of accuracy are involved, perhaps leading to failure to converge, or converging to a different solution. These differences are most troublesome when trying to prove the equivalence of difference systems for computer validation. Because one can predict that differences may arise, one should not pessimistically assume that different output means that either systems is faulty, nor assume that the systems are not equivalent.

One method of ensuring that the IEEE extended double precision floating-point (typically the Intel X86/Pentium) is not used is to alter the appropriate bit in the processor “control word”. This can only be done in assembler. Extended precision significand is 64-bit, double is only 53-bit, and memory values are always 53-bit (plus sign and exponent to total 64 bits). (See www.intel.com 27064004.pdf for 80C187 80-bit Math Coprocessor datasheet).

The default value of the code word bit is NOT extended for Microsoft Windows, but IS extended for gcc/Linux.

The third way to fix the problem is to 'force' all intermediate floating point results into memory. This is not an 'ideal' fix, since it is not fully equivalent to 53-bit precision (because of double rounding), but it is reported to work (although a full proof of correctness in this case is not available). See X86ControlWord.cpp for an example. C99 promises to provide a portable way of setting the precision.

NTL's quad_float code ensures use of 53-bit significands on all platforms by storing intermediate results in local variables declared to be 'volatile'. This is the solution to the problem that NTL uses if it detects the problem and can't fix it using the GNU 'asm' hack mentioned above. This solution should work on any platform that faithfully implements 'volatile' according to the ANSI C standard. See www.shoup.net for more details.

Note that MSVC++ compilers up to version 7 implement long double exactly as double, IEEE 64-bit floating-point format. (The compiler does NOT use IEEE 80-bit extended precision available within the Intel Pentium or X86 floating point co-processor processor family instruction set).

C/C++ Pre-processor should calculate any long integers using 36 bits (P J Plauger, The Standard C Library, p 78) but none of the constants in constants.h are currently of integer type. (Some are floating point versions of integer values, for example: unity, zero, two, ten…).

Validation

For most values, Victor Shoup's NTL has been used to compute (and/or cross-check) the values in this file with a significand of 300 bits (about 100 decimal digits) and values are output as 40 decimal digit strings (or more digits if exactly representable interval limit values). There is no typing or editing to introduce errors.

The numeric string values in the file constants.h are checked either one or both of two ways:
1 Checks that the values are consistent with values calculated by the compiler preprocessor.
2 Comparison with values in references tables: for example in D. E. Knuth, Appendix A (scanned in from the printed book and converted to a text file), or M. Abramowitz & Irene A Stegun.

Some values are be tested with straightforward checks against compiler preprocessor calculated values, for example comparing:

if (twoPi != 2. * pi) // may have a problem?

or C macro assert (which causes program termination if it fails - a more noticeable conclusion, but one that does not allow one to continue testing other constants).

assert(twoPi == 2. * pi)

but many naively formulated tests appear to fail to the extent that the two values differ by the machine least significant significand bit, so instead it is better to test the absolute relative difference between values thus:

if (fabs(twoPi - 2.L * pi)/ twoPi >= epsilon) // Do have a problem with accuracy!

See D E Knuth and Alberto Squassabia for more guidance on comparison of reals.

(Epsilon is the smallest floating-point representation of the least-significant significand bit. See below.)

The equivalent using the C macro assert is:

assert(fabs(sqrtTwo * sqrtTwo - 2.)/2. <= eps);

It is best to choose the constant to divide by to make the test relative to the constant value, rather than recomputing the check value.

Typical epsilon for type double for Intel X86 IEEE754/IEC559 hardware is 2.22e-016.

Minimum values specified by ISO 14882:1998 section 18.2.1 are defined as equivalent to the C Standard constants:

#define FLT_EPSILON <float rvalue <= 10^(-5)>
#define DBL_EPSILON <double rvalue <= 10^(-9)>
#define LDBL_EPSILON <long double rvalue <= 10^(-9)>

The macros yield the smallest X of type float, double or long double such that 1.0 + X != 1.0.

C++ versions of these are provided by the ISO 14882:1998 Standard numeric limits template, for example:

std::numeric_limits<float>::epsilon();
std::numeric_limits<double>::epsilon()
std::numeric_limits<long double>::epsilon();

The constants interval upper and lower limits should meet the following shown as an example for pi:

static const double pi_d = 3.1415926535897932384626433832; // closest.

static const double pi_d_l = 3.141592502593994140625; // lower &

static const double pi_d_u = 3.1415927410125732421875; // upper limit.

(pi_d_l == pi_d) || (pi_d_u == pi_d) // must equal upper or lower, or both

and

float ul_pi = pi_d_l + pi_d_u; // upper + lower

(ul_pi < 2.f * pi_d_u) && (ul_pi > 2.f * pi_d_l) // only one bit different

Version

This is shown in the header file comments, and also includes a timestamp and the version of NTL used.

Author

Paul A Bristow, pbristow@hetp.u-net.com, Oct 2002

Glossary

Floating-point – real type stored as sign, exponent and significand. Unlike integer types, most floating-point values cannot be stored exactly.

Exponent –power of 2 by which to multiply the significand (usually binary but strictly radix or base returned by std::numeric_limits< floating_point_type>::radix). Always signed, using one bit, and often offset or biased for storage convenience.

Mantissa – see significand.

radix – base for exponent, usually 2, (decimal and hexadecimal systems exist, but are not yet supported by this package).

Significand – (previously called mantissa) binary fraction where most significant bit represents 0.5, next 0.25, then 0.125 … The most significant bit (0.5) is usually not stored and is implicitly assumed to be set, so the significand always represents a fraction greater than 0.5, (but just less than 1). std::numeric_limits<floating_point_type>::digits returns this, including any implicit bit.

least significant bit – of significand – also called “unit in last place” ulp. (decimal or hexadecimal if radix or base is 10 or 16).

functions nextafter (float double and long double versions) changes the least significant bit but adding or subtracting 1 to the binary significand.

References & Sources

T. J. Dekker, Point Technique for Extending the Available Precision,
Numer. Math. 18 (1971), pp. 224-242.
S. Linnainmaa, Software for doubled-precision floating point computations
ACM TOMS 7, 172-283 (1981).
D. Priest, On properties of floating point arithmetics: numerical stability
and the cost of accurate computations. Ph.D. Diss, Berkeley 1992.
and more references in http: www.cs.wisc.edu/~shoup/ntl/quad_float.txt.
M. Abramowitz & Irene A Stegun, Handbook of Mathematical Functions,
Applied Math Series vol 55, National Bureau of Standards,
US Gov Printing Off, Washington DC, 1965 p 948 – 972.
Note that NIST project is in progress for a major revision of this. See http://dlmf.nist.gov .
Donald E. Knuth, The Art of Computer Programming,
Volume 1 Fundamental algorithms, Appendix A, P 619
ISBN 0 201 89683 4.
doubledouble implements doubled-double arithmetic (>30dp) on
IEEE 754 floating-point hardware in (not Standard) C++.
by Keith Briggs kmb28@cam.ac.uk. Definitive site for this code is:
http: www-epidem.plantsci.cam.ac.uk/~kbriggs/doubledouble.html
is used in NTL which is used to verify some of these constants.
Victor Shoup, NTL, www.shoup.net/ntl/ Numeric Template Library. Excellent arbitrary precision C++ function library. Used to calculate the constants in this file.
LiDIA, Darmstadt University of Technology
www.informatik.th-darmstadt.de/TI/LiDIA/
Plouffe's Tables of Constants, Nov 20 1999, by S Plouffe,
with rather more than ample decimal digits for this purpose!
www.lacim.uqam.ca/pi/table.html
P J Plauger, The Standard C Library, p 78 ISBN 0 13 131509 9 Prentice-Hall(1992).
David Goldberg,"What Every Computer Scientist Should Know About Floating Point Arithmetic", March 91, Computing Surveys.
http://www.acm.org/pubs/citations/journals/surveys/1991-23-1/p5-goldberg/
http://docs.sun.com/htmlcoll/coll.648.2/iso-8859-1/NUMCOMPGD/ncg_goldberg.html
www.intel.com 27064004.pdf for 80C187 80-bit Math Coprocessor datasheet).
AIX Version 4.3 General Programming Concepts, 128-bit long double float-point data type.
http://www.opengroup.org/onlinepubs/007908799/xsh/math.h.html
POSIX math.h provides some double(but NOT long double) precision constants.
ftp://ftp.ccs.neu.edu/pub/people/will/howtoread.ps
William D Clinger, In Proceedings of the 1990 ACM Conference on Principles of Programming Languages, pages 92-101.
How to read Floating-point accurately.
Abstract: Consider the problem of converting decimal scientific notation for a number into the best binary floating-point approximation to that number, for some fixed precision. This problem cannot be solved using arithmetic of any fixed precision. Hence the IEEE Standard for Binary Floating-Point Arithmetic does not require the result of such a conversion to be the best approximation.
This paper presents an efficient algorithm that always finds the best approximation. The algorithm uses a few extra bits of precision to compute an IEEE-conforming approximation while testing an intermediate result to determine whether the approximation could be other than the best. If the approximation might not be the best, then the best approximation is determined by a few simple operations on multiple-precision integers, where the precision is determined by the input. When using 64 bits of precision to compute IEEE double precision results, the algorithm avoids higher-precision arithmetic over 99% of the time.
John F. Hart, Computer Approximations, Kreiger (1978) ISBN 0 88275 642 7, table of constants pages 334 - 335.
Cephes Mathematical Library, Stephen L. B. Moshier, www.netlib.org/cephes/ and http://people.ne.mediaone.net/moshier/index.html - in C. A perl interface and a DOS calculator also available.
Methods and Programs for Mathematical Functions, Stephen L. B. Moshier, J Wiley, (1989) ISBN 0 7458-0289-3 & 0-470-21609 3 & 0 7458 0805 0.
class constant based on a post by Michael Kenniston: http://groups.yahoo.com/group/boost/message/14867
Software manual for the elementary functions. William J. Cody, Jr. and William Waite, Prentice-Hall series in computational mathematics, Englewood Cliffs, N.J., Prentice-Hall. 1980 ISBN: 0138220646.
http://http.cs.berkeley.edu/~wkahan/ home page of William Kahan, pioneer of IEEE754 specification.
http://http.cs.berkeley.edu/~wkahan/imporber.pdf, The Improbability of PROBABILISTIC ERROR ANALYSES for Numerical Computations.
http://http.cs.berkeley.edu/~wkahan/names.pdf gives useful definitions and explanations of radix, exponent, etc.
http://http.cs.berkley.edu/~wkahan/ieee754status/ieee754.ps page 4 gives significant digits for real formats.
Alberto Squassabia, C++ Report Archives, March 2000 http://www.joopmag/html/from_pages/crarticle.asp?ID=396 & 397
Knuth, Donald E., Art of Computer Programming,
Programming Languages - C++, International Standard ISO/IEC 14882-1998,
Programming Languages – C, ISO/IEC 9899:1990
Format of IBM S/390 processors, E. M. Schwarz & C. A. Krygowski, http://researchweb.watson.ibm.com/journal/rd/435/schwarz.html
IEEE_754 Floating Point formats, Christopher Vickery http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html
Physical constants and SI units http://physics.nist.gov/cuu/Units/bibliography.html
W. E. Brown, http://home.fnal.gov/~wb/SItempl8.pdf
Sun Numerical Computation Guide, http://docs.sun.com/source/806-3568/ncg_lib.html

Acknowledgements:

The following contributed to discussions on the BOOST.org forum:
K Hagan
Petr Kocmid
P J Plauger
Herve Bronnimann
John Maddock
Peter Schmittteckert
Jes Maurer
Peter Dimov
Beman Dawes
David Abrahams
Ed Bray
Matt Austern
Gabriel Dos Reis
Eric Ford
Michael Kenniston
John Hickin
Sylvain Pion
Guillaume Melquiond

Author: Paul A. Bristow, hetp Chromatography Quantitation,