# Boost :

From: rogeeff (rogeeff_at_[hidden])
Date: 2001-12-09 22:36:42

Hi, all

I have couple question to ask and couple statement to confirm:

1. There are only three sources of measurable floating point rounding
errors:
* Conversion from decimal presentation to binary
* Type promotion
* Arithmetic operations

There also non-measurable errors while using non-arithmetic
operations.

2. The value of rounding error does not exceed:

1/2 * ULP(Unit in Last Point)

ULP value is defined by binary format of floating point type. The
float, for example, has ULP = 2^-23, for precision of float normal
numbers consist of 24 bits (23-bit fraction combined with the
implicit leading significant bit). From definition of ULP follows
that ULP is a minimum value such that 1 + ULP != 1

The source of 1/2 is obscure to me. Could it be 1?

3. It seems that calculation using float numbers simetimes have
better precision (less relative error) than double. How could this be?

4. To calculate a relative error I am using following formula:

|v1 - v2| / |v1|. As you can see it involves 2 additional arithmetic
operations. So, should I add 2 to nmuber of arithmetic operation
involved when calculating tolerance value?

5. Finnaly lab work. Could you check an output from the following
program and explain why relative error exceed expected values:

#include <limits>
#include <stdio.h>

// 1 rounding error = 1/2 * ULP
template<typename FPT>
void
foo0( FPT dummy ) {
FPT e = std::numeric_limits<FPT>::epsilon();
sqrt
FPT d1 = static_cast<FPT>( 1.1 ); // 1 rounding errors

FPT d2 = (d1*10 - 11)/d1;

printf( "%20.20e\n", d2/e ); // expected value < 0.5
}

template<typename FPT>
void
foo1( FPT dummy ) {
FPT e = std::numeric_limits<FPT>::epsilon();

FPT d1 = static_cast<FPT>( 1.1 ); // 1 rounding errors
FPT d2 = d1 * d1; // 1 rounding errors
FPT d3 = static_cast<FPT>( 1.21 ); // 1 rounding errors

FPT d4 = d2 - d3;
FPT d5 = d4 / d2;

printf( "%20.20e\n", d5/e ); // expected value < 1.5
}

template<typename FPT>
void
foo2( FPT dummy ) {
FPT e = std::numeric_limits<FPT>::epsilon();

FPT d1 = static_cast<FPT>( 1.1 ); // 1 rounding errors
FPT d2 = d1 * d1; // 1 rounding errors
FPT d3 = d2 - d1; // 1 rounding errors
FPT d4 = static_cast<FPT>( 0.11 ); // 1 rounding errors

FPT d5 = d3 - d4;
FPT d6 = d5 / d3;

printf( "%20.20e\n", d6/e ); // expected value < 2
}

int main()
{
float dummy1 = 0;
double dummy2 = 0;

foo0( dummy1 );
foo0( dummy2 );

foo1( dummy1 );
foo1( dummy2 );

foo2( dummy1 );
foo2( dummy2 );

return 0;
}

// ************************************************************* //

Output when compiled by MSVC:

1.81818177877378950000e+000
0.00000000000000000000e+000
0.00000000000000000000e+000
8.26446280991735450000e-001
2.27272666929197960000e+000
3.97727272727272400000e+000

Thank you,