Boost logo

Boost :

From: Fernando Cacciola (fcacciola_at_[hidden])
Date: 2001-12-10 14:27:35


----- Original Message -----
From: rogeeff <rogeeff_at_[hidden]>
To: <boost_at_[hidden]>
Sent: Monday, December 10, 2001 12:36 AM
Subject: [boost] Re: Floating Point comparisons

> Hi, all
>
> I have couple question to ask and couple statement to confirm:
>
First, we need to set up a context:

The behavior we are going to discuss is specified by the standards known as
IEEE 754 and IEC 559 (they are the same)
Unfortunately, AFAICT, neither C or C++ standards *requires* conformance to
those standards,
so the statements below are subject to the implementation.
It is implied that the following discussion applies to those platforms which
implement
IEEE754 floating point formats and arithmetic.
This *should* be testable by inspecting 'numeric_limits<>::is_iec559".
Unfortunately, Borland C++ 5.5.1 sets this to 'false', which isn't actually
true since
x87 machines are IEEE754 (thus IEC559) conformant.
Probably, other compilers 'lie' about this too.

> 1. There are only three sources of measurable floating point rounding
> errors:
> * Conversion from decimal presentation to binary
> * Type promotion
> * Arithmetic operations
>
This is correct as long as we properly define the meaning of some words in
the
statement:

"rounding" is the operation by which a real number 'x' is represented in a
floating-point
format with 'p' binary digits (bits) as the fp (floating-point) number 'X'

rounding "error" is the difference between the real and the fp values:
It is |x-X| as absolute and |x-X|/|x| as relative.

"measurable" rounding error means the error is bounded by a predictable
upper limit.

It is important to notice that although the above operations might produce
rounding,
they might also produce other errors, such as underflow/overflow and
division-by-zero
(assuming valid operands)

There is also an important issue regarding "rounding" and "arithmetic
operations":

All theorems about the upper limit of a rounding error,
including that of (1/2)ulp discussed below, refers *only* to the 'rounding'
operation,
nothing more. This means that the 'operation error', that is, the error
incurred
by the operation itself, besides rounding, isn't considered.
Now, in order for numerical software to be able to actually predict error
bounds, the
IEEE754 requires algebraic operations to be 'correctly or exactly rounded'.
That is, it is required that the internal computation of a given operation
be such that
the fp result is the *exact* result rounded to the number of working bits.
In other words, it is required that the computation used by the operation
itself doesn't
introduce any additional errors.
It might be trivial to imagine this being the case for arithmetic operations
(+ - * /).
However, the IEEE754 also requires SQRT to be exactly rounded (which
involves a careful
implementation) and explicitly doesn't require the same behavior for
transcendental functions.

Unfortunately, AFAICT, most C/C++ compilers don't state explicitly if their
sqrt() implementation is exactly rounded as specified by IEEE754.
This require us to limit our statements to arithmetic operations only.

> There also non-measurable errors while using non-arithmetic
> operations.
>
Correct.
There are also non-measurable errors due to underflow/overflow and
division-by-zero.
For example, converting (double)(1e+308) to float yields and error that
isn't considered
"rounding error", but "out-of-range error". The same happens with (1e+38f *
1e+38f)

> 2. The value of rounding error does not exceed:
>
> 1/2 * ULP(Unit in Last Point)
>
Correct.

This error bound is given by the fact that when a real number
'x' is represented as a fp number 'X', X is chosen to be the nearest fp
number
that approximates 'x'. Assuming no underflow/overflow, there are always 2 fp
numbers
'closer' to 'x' (to the left and right respectively): 'LX' and 'RX';
they differ *exactly* in the last binary digit; that is, in "one unit in the
last place".
Thus (RX-LX)=1 ulp.
Assuming round to nearest, since both x and X are inside [LX,RX],
then the error |x-X| <= (1/2)ulp.
In other words, it means that the fp representation of a real number cannot
differ from it
more than half a ULP (assuming default rounding and no underflow/overflow)

> ULP value is defined by binary format of floating point type.

As only '1' in the last binary digit in the significand.

> The
> float, for example, has ULP = 2^-23, for precision of float normal
> numbers consist of 24 bits (23-bit fraction combined with the
> implicit leading significant bit).

Not exactly.
As I mentioned, ULP stands for 'unit in the last place'
which is the last binary digit in the significand.

But, the "*absolute* value" of a floating point number is given by: n*2^exp,
so,
the absolute value of "1 last bit" in 'n' (1 ulp) depends on the exponent.

Now, for an exponent of 1.0, this last 1 binary digit (1 ulp) has the
absolute value:

2e(-(p-1)) { p = number of bits }

which is the number you showed, but, for an exponent of, say +38,
this *same 1 last bit* (still 1 ulp) has absolute value of:

2e-[(p-1)+38] == 2e(-37-p)

> From definition of ULP follows
> that ULP is a minimum value such that 1 + ULP != 1
>
As I explained above, 1 ulp is always 1 ulp regardless the exponent.
In the particular case of exponent 1, the absolute value of 1 ulp is called
the "machine epsilon", and is equal to numeric_limits<>::epsilon() on
conformant platforms.

It follows from the ulp definition that ( 1 + eps > 1 ).

This is quite similar to what you wrote, except that the inequality contains
"eps" instead
of "ULP".

> The source of 1/2 is obscure to me. Could it be 1?

This comes from the fact that the 'real' number is approximated by the
closer fp value,
either to the left or right of it.
If round to nearest is used (the default) you get half ulp, but it could be
up to 1 ulp
if the rounding mode is changed.

Now, there are important issues to understand:

a) The rounding error is always bounded by 1/2 ulp (assuming default
rounding).
b) The absolute value of 1 ulp depends on the exponent.
c) 'epsilon' is an absolute value which corresponds to 1 ulp of a 1.0e1.0.
d) The *absolute* rounding error of an arbitrary number isn't bounded by
epsilon directly
   but by 'epsilon' accordingly scaled to the number's exponent.
e) The relative error, that is |x-X|/|x|, remains proportional to 'eps',
   so it can be compared against it.

The 'floating point comparison' algorithms all test the predicted relative
rounding error
against eps because of (e)

>
> 3. It seems that calculation using float numbers sometimes have
> better precision (less relative error) than double. How could this be?

AFAIK, this isn't true. How do you inferred it?

>
> 4. To calculate a relative error I am using following formula:
>
> |v1 - v2| / |v1|. As you can see it involves 2 additional arithmetic
> operations. So, should I add 2 to number of arithmetic operation
> involved when calculating tolerance value?
>
Good question.
Unfortunately, I think it is even more complicated.
For example, |v1-v2| can cause catastrophic cancellation (all bits flushed
out), yielding
a relative error of 0.
IMO, the formulas of error bounds are given to be able to understand the
behavior of
a computation, not to actually try to 'compute' the error.

> 5. Finnaly lab work. Could you check an output from the following
> program and explain why relative error exceed expected values:
>
Unfortunately, I couldn't understand the design of these tests.
Could you explain it?

Fernando Cacciola
Sierra s.r.l.
fcacciola_at_[hidden]
www.gosierra.com


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk