Boost logo

Boost :

From: Charles Karney (ckarney_at_[hidden])
Date: 2006-08-07 11:37:20


>>I haven't delved into the code to see where all these problems arise.
>
> interestingly, just two days ago I received similar report regarding
> "bad" distribution of uniform_int . Here is test program (comparing
> number of random results in top and bottom 10% of range):

This bug has a different cause from the ones I reported. Here I
identify the underlying causes for the bugs:

> (1) uniform_int<char>(low, high) produces values outside [low, high].
> E.g.,
>
> variate_generator<random_generator&, uniform_int<char> >
> unichar(r, uniform_int<char>(-100, 100));
>
> (2) uniform_int<short>(low, high) produces values outside [low, high].
> E.g.,
>
> variate_generator<random_generator&, uniform_int<short int> >
> unishort(r, uniform_int<short int>(-20000, 20000));

These are both caused by overflow in the statement

    _range = _max - _min;

in init() in random/uniform_int.hpp. For signed types this can
overflow, so that _range is negative even if _max >= _min. The wreaks
havoc with the rest of the code.

> (3) uniform_int<long long>(low, high) produces a distribution with the
> wrong mean. E.g.,
>
> variate_generator<random_generator&, uniform_int<long long> >
> unill(r, uniform_int<long long>(0LL, 0xffffffffffLL));
>
> has a mean of about 2.0e9 instead of 5.5e11.
>
> (4) uniform_int<long long>(low, high) sometimes takes a LONG time.
> E.g.,
>
> variate_generator<random_generator&, uniform_int<long long> >
> unillint(r, uniform_int<long long>(0LL, 0x10000000000LL));
> unillint();
>
> takes about 40 seconds on a 800 MHz Intel PC.

These are both caused by problems in do_compare<false, true> in
random/detail/signed_unsigned_compare.hpp. In particular the statement

  static bool lessthan(T1 x, T2 y)
    { return y >= 0 && x < static_cast<T1>(y); }

returns nonsensical results with T1 = unsigned and T2 = long long. A
possible band-aid would be

  static bool lessthan(T1 x, T2 y) {
    return y >= 0 &&
      (std::numeric_limits<T1>::max() >= std::numeric_limits<T2>::max() ?
       x < static_cast<T1>(y) :
       static_cast<T2>(x) < y);
  }

However I fear that there are other places where overflow can occur in
uniform_int.

The following wasn't reported in my earlier E-mail, but illustrates the
bug forwarded by Bronek Kozicki:

(3a) uniform_int<int>(low, high) produces a distribution with the
wrong mean. E.g.,

    variate_generator<random_generator&, uniform_int<int> >
      uniint(r, uniform_int<int>(0, 800000000-1));

has a mean of about 3.7e8 instead of 4.0e8. This is caused by
uniform_int invoking uniform_smallint which typically yields approximate
results. The threshold for this

      if(brange / _range > 4) {
        return boost::uniform_smallint<result_type>(_min, _max)(eng);
      }

is absurdly generous. In fact, I would argue that uniform_int should
NEVER call uniform_smallint but should ALWAYS return accurate results
(assuming that the underlying generator is perfect).

Unfortunately if the test (brange / _range > 4) is changed to (false), a
very inefficient rejection method takes over (see the 40 sec computation
of a uniform_int<long long>(low, high) in (4) above). For an efficient
rejection method that returns uniform integers in an arbitrary range,
see the definition of IntegerC<T>(T m, T n) in

    http://charles.karney.info/random/

> (5) uniform_01<random_generator, float> uni01(r) sometimes returns a
> result outside [0,1). Specifically it can return 1.0f because
> (2^32-1)/2^32 rounds to 1.

uniform_01 returns

    result_type(_rng() - _rng.min()) /
        (result_type(_rng.max() - _rng.min()) + 1)

This has two problems. If result_type is float, the result can overflow
to 1 (for mt19937). If result_type is double or long double, the result
only contains 32 bits of randomness (instead of 53 bits or 64 bits).

Fixing these problems appears to be a very messy undertaking given the
present structure of the random number library. Thus dealing with a
random generator returning long ints in [-2e9, 2e9] is going to be a
pain since 4e9 isn't representable as a long int.

One possible solution is to require more of the generators. Thus
integer generators might be required to return unsigned results in
[0,max]. Similarly real generators might be required to return results
representable as eps*[0,max] for some (queriable) eps.

I might even suggest going further and requiring that max = 2^b - 1, so
that a whole number of bits of randomness are returned. This greatly
simplifies the computation of the basic integer and real distributions.

In particular, my random number library (based on mt129937, see web site
given above) provides implementations of uniform real distributions
which as equivalent to sampling uniformly in [0,1] and EXACTLY rounding
the results down either to a multiple of the machine precision, e.g.,
Fixed<T>(), or to the nearest representable floating point number, e.g.,
Float<T>().

Perhaps this might slow some generators down (but notably not mt19937).
But that is far preferable to the current situation where incorrect
results are being obtained.

-- 
Charles Karney <ckarney_at_[hidden]>
Sarnoff Corporation, Princeton, NJ 08543-5300
URL: http://charles.karney.info
Tel: +1 609 734 2312
Fax: +1 609 734 2662

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk