 # Boost :

From: Guillaume Melquiond (gmelquio_at_[hidden])
Date: 2003-06-21 17:50:32

On Sat, 21 Jun 2003, Gennaro Prota wrote:

> On Fri, 20 Jun 2003 22:04:53 +0200 (CEST), Guillaume Melquiond
> wrote:
>
> [...]
> >I know this part of the standard. But it doesn't apply in the situation I
> >was describing. I was describing the case of a constant whose decimal (and
> >consequently binary) representation is not finite. It can be an irrational
> >number like pi; but it can also simply be a rational like 1/3.
>
> I was just trying to start from something sure, such as the standard
>
> The way I read the paragraph quoted above, "the scaled value" is the
> value intended in the mathematical sense, not its truncated internal
> representation. So the compiler must behave *as if* it considered all
> the digits you provide and choose the nearest element (smaller or
> larger - BTW C99 is different in this regard). In your example:
>
> the constant is 1.00050001236454786005785305678....
> you write it with 7 seven digits: 1.000500
> the floating-point format only uses 4 digits
>
> you wouldn't write 1.000500 but, for instance, 1.00050001 and the
> hypothetical base-10 implementation should then consider all the
> digits even if it can store only 4 of them. Thus if it chooses the
> *larger* value nearest to that it chooses 1.001. Of course if it
> chooses the smaller..., but that's another story.

Yes you're right, but you have lost track of the reason why I was giving
this example. I was speaking about the conversion from a constant in 'long
double' to 'double'. In your case, you can always add numbers to correct
the problem. But in the situation I was describing, you can't: the 'long
double' format is fixed once and for all; you can't suddenly add digits if
you need them.

So, even if you have the 'long double' representation of a constant, you
may still need the 'double' representation of this constant if you want
the 'double' version to be the nearest number.

> Just to understand each other: suppose I write
>
> double d =
> 2348542582773833227889480596789337027375682548908319870707290971532209025114608443463698998384768703031934976.0;
> // 2**360
>
> The value is 2**360. On my implementation, where DBL_MAX_EXP is 1024
> and FLT_RADIX is 2, isn't the compiler required to accept and
> represent it exactly?

Yes it is.

> >When manipulating such a number, you can only give a finite number of
> >decimal digit. And so the phenomenon of double rounding I was describing
> >will occur since you first do a rounding to have a finite number of
> >digits and then the compiler do another rounding (which is described by
> >the part of the standard you are quoting) to fit the constant in
> >the floating-point format.
>
> I'm not very expert in this area. Can you give a real example
> (constant given in decimal and internal representation non-decimal)?

Okay, here is a "real" example. Let's suppose you have an x86 processor:
'long double' mantissa is 64 bits wide, 'double' is 53 (52 + one
implicit). The constant is the fractional number:
c = 73786976294838196225 / 55340232221128654848 (it's 1/3-3413/2^64)

Since you don't want the compiler to do the rounding (because of the
standard we can't be sure of the rounding direction), you give the exact
representation of the 'long double' nearest number.

long double c_l() {
return 1.3333333333333332593184650249895639717578887939453125;
}

We are sure c_l is the nearest 'long double'. Now we want the nearest
'double'. Can we simply do:

double c_d() { return c_l(); }

No, we can't. Because the constant was specially crafted to make this
conversion fail (if rounded to nearest). And it wasn't that difficult, it
is just the matter of a four decimal digits number (3413) to make it fail.

In fact, it is enough that 13 bits have a particular value.
Consequently, more than one constant out of 10000 may suffer from this
problem. So it is rare, but it is not impossible. It's why I was
suggesting that a library should provide a mean to know if a number
representing a constant is the nearest or not. Is it a bit more clear?

And for people still wondering why I didn't use this kind of code (please
note that the value of the constant has changed, this time it's the
nearest 53 decimal digits to the constant):

#define c 1.3333333333333331483142325986820016699615128648777803
long double c_l() { return c; }
double c_d() { return c; }

It's because the standard doesn't guarantee the direction of rounding. So
not only the 'double' value could be incorrect, but the 'long double'
value could also be incorrect. Contrarily to the previous example where at
least the 'long double' value was correct.

So, in the end, it's only a matter of what we expect from a constant
library. Do we expect it to give the nearest number for a given precision?
As for me, I do.

Regards,

Guillaume

PS: To summarize, for a given constant, if we want the best values without
relying on the rounding done by the compiler, we need to have an exact
representation for each available format. The most common floating-point
formats are 32, 64, 80, 64+64 and 128 bits. So it's 5 values by constant
(15 values if we also count lower and upper bounds). Because of the huge
number of required values, some of them may be missing. If that is the
case, the library needs a way to warn the program that it will not get the
nearest value but only an approximation.