Boost-Commit :

Date view	Thread view	Subject view	Author view

Subject: [Boost-commit] svn:boost r50855 - sandbox/math_toolkit/libs/math/doc/sf_and_dist
From: pbristow_at_[hidden]
Date: 2009-01-28 12:01:53

Next message: pbristow_at_[hidden]: "[Boost-commit] svn:boost r50856 - sandbox/SOC/2007/visualization/boost/svg_plot/detail"
Previous message: pbristow_at_[hidden]: "[Boost-commit] svn:boost r50854 - sandbox/math_toolkit/libs/math/doc/sf_and_dist"

Author: pbristow
Date: 2009-01-28 12:01:52 EST (Wed, 28 Jan 2009)
New Revision: 50855
URL: http://svn.boost.org/trac/boost/changeset/50855

Log:
Made overview heading into sub-sections and added more to Why complements tip.
Text files modified:
sandbox/math_toolkit/libs/math/doc/sf_and_dist/dist_tutorial.qbk | 152 +++++++++++++++++++++++----------------
1 files changed, 91 insertions(+), 61 deletions(-)

Modified: sandbox/math_toolkit/libs/math/doc/sf_and_dist/dist_tutorial.qbk
==============================================================================
--- sandbox/math_toolkit/libs/math/doc/sf_and_dist/dist_tutorial.qbk (original)
+++ sandbox/math_toolkit/libs/math/doc/sf_and_dist/dist_tutorial.qbk 2009-01-28 12:01:52 EST (Wed, 28 Jan 2009)
@@ -5,17 +5,17 @@
[def __F_distrib [link math_toolkit.dist.dist_ref.dists.f_dist Fisher F Distribution]]
[def __students_t_distrib [link math_toolkit.dist.dist_ref.dists.students_t_dist Students t Distribution]]

-[def __handbook [@http://www.itl.nist.gov/div898/handbook/
+[def __handbook [@http://www.itl.nist.gov/div898/handbook/
NIST/SEMATECH e-Handbook of Statistical Methods.]]

[section:stat_tut Statistical Distributions Tutorial]
This library is centred around statistical distributions, this tutorial
-will give you an overview of what they are, how they can be used, and
+will give you an overview of what they are, how they can be used, and
provides a few worked examples of applying the library to statistical tests.

-[section:overview Overview]
+[section:overview Overview of Distributions]

-[h4 Headers and Namespaces]
+[section:headers Headers and Namespaces]

All the code in this library is inside namespace boost::math.

@@ -27,27 +27,29 @@
<boost/math/distributions/students_t.hpp> or
<boost/math/distributions.hpp>

-[h4 Distributions are Objects]
+[endsect] [/ section:headers Headers and Namespaces]

-Each kind of distribution in this library is a class type.
+[section:objects Distributions are Objects]

-[link math_toolkit.policy Policies] provide fine-grained control
+Each kind of distribution in this library is a class type - an object.
+
+[link math_toolkit.policy Policies] provide fine-grained control
of the behaviour of these classes, allowing the user to customise
behaviour such as how errors are handled, or how the quantiles
of discrete distribtions behave.

[tip If you are familiar with statistics libraries using functions,
and 'Distributions as Objects' seem alien, see
-[link math_toolkit.dist.stat_tut.weg.nag_library the comparison to
-other statistics libraries.]
+[link math_toolkit.dist.stat_tut.weg.nag_library the comparison to
+other statistics libraries.]
] [/tip]

Making distributions class types does two things:

-* It encapsulates the kind of distribution in the C++ type system;
-so, for example, Students-t distributions are always a different C++ type from
+* It encapsulates the kind of distribution in the C++ type system;
+so, for example, Students-t distributions are always a different C++ type from
Chi-Squared distributions.
-* The distribution objects store any parameters associated with the
+* The distribution objects store any parameters associated with the
distribution: for example, the Students-t distribution has a
['degrees of freedom] parameter that controls the shape of the distribution.
This ['degrees of freedom] parameter has to be provided
@@ -57,60 +59,63 @@
are typedefs on type /double/ that mostly take the usual name of the
distribution
(except where there is a clash with a function of the same name: beta and gamma,
-in which case using the default template arguments - `RealType = double` -
+in which case using the default template arguments - `RealType = double` -
is nearly as convenient).
Probably 95% of uses are covered by these typedefs:

    using namespace boost::math;
-
+
    // Construct a students_t distribution with 4 degrees of freedom:
    students_t d1(4);
-
- // Construct a double-precision beta distribution
+
+ // Construct a double-precision beta distribution
    // with parameters a = 10, b = 20
    beta_distribution<> d2(10, 20); // Note: _distribution<> suffix !
-
+
If you need to use the distributions with a type other than `double`,
-then you can instantiate the template directly: the names of the
+then you can instantiate the template directly: the names of the
templates are the same as the `double` typedef but with `_distribution`
appended, for example: __students_t_distrib or __binomial_distrib:

    // Construct a students_t distribution, of float type,
    // with 4 degrees of freedom:
    students_t_distribution<float> d3(4);
-
+
    // Construct a binomial distribution, of long double type,
    // with probability of success 0.3
    // and 20 trials in total:
    binomial_distribution<long double> d4(20, 0.3);
-
+
The parameters passed to the distributions can be accessed via getter member
functions:

- d1.degrees_of_freedom(); // returns 4.0
-
+ d1.degrees_of_freedom(); // returns 4.0
+
This is all well and good, but not very useful so far. What we often want
is to be able to calculate the /cumulative distribution functions/ and
/quantiles/ etc for these distributions.

-[h4 Generic operations common to all distributions are non-member functions]
+[endsect] [/section:objects Distributions are Objects]
+
+
+[section:generic Generic operations common to all distributions are non-member functions]

Want to calculate the PDF (Probability Density Function) of a distribution?
No problem, just use:

    pdf(my_dist, x); // Returns PDF (density) at point x of distribution my_dist.
-
+
Or how about the CDF (Cumulative Distribution Function):

    cdf(my_dist, x); // Returns CDF (integral from -infinity to point x)
                      // of distribution my_dist.
-
+
And quantiles are just the same:

    quantile(my_dist, p); // Returns the value of the random variable x
                           // such that cdf(my_dist, x) == p.
-
-If you're wondering why these aren't member functions, it's to
+
+If you're wondering why these aren't member functions, it's to
make the library more easily extensible: if you want to add additional
generic operations - let's say the /n'th moment/ - then all you have to
do is add the appropriate non-member functions, overloaded for each
@@ -124,9 +129,9 @@
for example in a uniform, normal or triangular,
see [@http://www.boost.org/libs/random/ Boost.Random].

-Whilst in principal there's nothing to prevent you from using the
+Whilst in principal there's nothing to prevent you from using the
quantile function to convert a uniformly distributed random
-number to another distribution, in practice there are much more
+number to another distribution, in practice there are much more
efficient algorithms available that are specific to random number generation.
] [/tip Random numbers that approximate Quantiles of Distributions]

@@ -136,7 +141,7 @@
The `binomial_distribution` constructor therefore has two parameters:

`binomial_distribution(RealType n, RealType p);`
-
+
For this distribution the random variate is k: the number of successes observed.
The probability density\/mass function (pdf) is therefore written as ['f(k; n, p)].

@@ -153,15 +158,15 @@
] [/tip Random Variates and Distribution Parameters]

As noted above the non-member function `pdf` has one parameter for the distribution object,
-and a second for the random variate. So taking our binomial distribution
+and a second for the random variate. So taking our binomial distribution
example, we would write:

`pdf(binomial_distribution<RealType>(n, p), k);`

-The ranges of random variate values that are permitted and are supported can be
+The ranges of random variate values that are permitted and are supported can be
tested by using two functions `range` and `support`.

-The distribution (effectively the random variate) is said to be 'supported'
+The distribution (effectively the random variate) is said to be 'supported'
over a range that is
[@http://en.wikipedia.org/wiki/Probability_distribution
  "the smallest closed set whose complement has probability zero"].
@@ -180,15 +185,15 @@
zero is not the most useful value for the lower limit of supported, as we discovered.
So for this, and similar distributions,
we have decided it is most numerically useful to use
-the closest value to zero, min_value, for the limit of the supported range.
+the closest value to zero, min_value, for the limit of the supported range.
(The `range` remains from zero, so you will still get `pdf(weibull, 0) == 0`).
(Exponential and gamma distributions have similarly discontinuous functions).

Mathematically, the functions may make sense with an (+ or -) infinite value,
but except for a few special cases (in the Normal and Cauchy distributions)
-this implementation limits random variates to finite values from the `max`
+this implementation limits random variates to finite values from the `max`
to `min` for the `RealType`.
-(See [link math_toolkit.backgrounders.implementation.handling_of_floating_point_infinity
+(See [link math_toolkit.backgrounders.implementation.handling_of_floating_point_infinity
Handling of Floating-Point Infinity] for rationale).

@@ -196,23 +201,23 @@

[*Discrete Probability Distributions]

-Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution
+Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution
discrete distributions], including the binomial, negative binomial, Poisson & Bernoulli,
are all mathematically defined as discrete functions:
-that is to say the functions `cdf` and `pdf` are only defined for integral values
+that is to say the functions `cdf` and `pdf` are only defined for integral values
of the random variate.

However, because the method of calculation often uses continuous functions
it is convenient to treat them as if they were continuous functions,
and permit non-integral values of their parameters.

-Users wanting to enforce a strict mathematical model may use `floor`
-or `ceil` functions on the random variate prior to calling the distribution
+Users wanting to enforce a strict mathematical model may use `floor`
+or `ceil` functions on the random variate prior to calling the distribution
function.

The quantile functions for these distributions are hard to specify
in a manner that will satisfy everyone all of the time. The default
-behaviour is to return an integer result, that has been rounded
+behaviour is to return an integer result, that has been rounded
/outwards/: that is to say, lower quantiles - where the probablity
is less than 0.5 are rounded down, while upper quantiles - where
the probability is greater than 0.5 - are rounded up. This behaviour
@@ -221,17 +226,17 @@
the requested coverage will be present in the tails.

This behaviour can be changed so that the quantile functions are rounded
-differently, or return a real-valued result using
+differently, or return a real-valued result using
[link math_toolkit.policy.pol_overview Policies]. It is strongly
-recommended that you read the tutorial
+recommended that you read the tutorial
[link math_toolkit.policy.pol_tutorial.understand_dis_quant
Understanding Quantiles of Discrete Distributions] before
using the quantile function on a discrete distribtion. The
-[link math_toolkit.policy.pol_ref.discrete_quant_ref reference docs]
+[link math_toolkit.policy.pol_ref.discrete_quant_ref reference docs]
describe how to change the rounding policy
for these distributions.

-For similar reasons continuous distributions with parameters like
+For similar reasons continuous distributions with parameters like
"degrees of freedom"
that might appear to be integral, are treated as real values
(and are promoted from integer to floating-point if necessary).
@@ -239,16 +244,17 @@
degrees of freedom do have a genuine meaning.
]

+[endsect] [/ section:generic Generic operations common to all distributions are non-member functions]
+
[#complements]
-[h4 Complements are supported too]
+[section:complements Complements are supported too - and when to use them]

Often you don't want the value of the CDF, but its complement, which is
-to say `1-p` rather than `p`. You could calculate the CDF and subtract
+to say `1-p` rather than `p`. It is tempting to calculate the CDF and subtract
it from `1`, but if `p` is very close to `1` then cancellation error
-will cause you to lose significant digits. In extreme cases, `p` may
-actually be equal to `1`, even though the true value of the complement is non-zero.
+will cause you to lose accuracy, perhaps totally.

-[link why_complements See also ['"Why complements?"]]
+[link why_complements See below ['"Why and when to use complements?"]]

In this library, whenever you want to receive a complement, just wrap
all the function arguments in a call to `complement(...)`, for example:
@@ -262,7 +268,7 @@
by wrapping all of its arguments in a call to `complement(...)`, for example:

    students_t dist(5);
-
+
    for(double i = 10; i < 1e10; i *= 10)
    {
       // Calculate the quantile for a 1 in i chance:
@@ -271,12 +277,12 @@
       cout << "Quantile of students-t with 5 degrees of freedom\n"
               "for a 1 in " << i << " chance is " << t << endl;
    }
-
+
[tip

[*Critical values are just quantiles]

-Some texts talk about quantiles, others about critical values, the basic rule is:
+Some texts talk about quantiles, or percentiles, others about critical values, the basic rule is:

['Lower critical values] are the same as the quantile.

@@ -284,7 +290,7 @@
of the probability.

For example, suppose we have a Bernoulli process, giving rise to a binomial
-distribution with success ratio 0.1 and 100 trials in total. The
+distribution with success ratio 0.1 and 100 trials in total. The
['lower critical value] for a probability of 0.05 is given by:

`quantile(binomial(100, 0.1), 0.05)`
@@ -311,7 +317,7 @@
Or to look at this another way: consider that we want the risk of falsely
rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion,
for a sample size of 10,000.
-This gives a probability of 1 - 10[super -9], which is exactly 1 when
+This gives a probability of 1 - 10[super -9], which is exactly 1 when
calculated at float precision. In this case calculating the quantile from
the complement neatly solves the problem, so for example:

@@ -326,11 +332,32 @@
`quantile(students_t(10000), 1)`

Which has no finite result.
-]

-[h4 Parameters can be calculated]
+With all distributions, even for more reasonable probability
+(unless the value of p can be represented exactly in the floating-point type)
+the loss of accuracy quickly becomes significant if you simply calculate probability from 1 - p
+(because it will be mostly garbage digits for p ~ 1).
+
+So always avoid, for example, using a probability near to unity like 0.99999
+
+`quantile(my_distribution, 0.99999)`
+
+and instead use
+
+`quantile(complement(my_distribution, 0.00001))`
+
+since 1 - 0.99999 is not exactly equal to 0.00001 when using floating-point arithmetic.

-Sometimes it's the parameters that define the distribution that you
+This assumes that the 0.00001 value is either a constant,
+or can be computed by some manner other than subtracting 0.99999 from 1.
+
+] [/ tip *Why bother with complements anyway?]
+
+[endsect] [/ section:complements Complements are supported too - and why]
+
+[section:parameters Parameters can be calculated]
+
+Sometimes it's the parameters that define the distribution that you
need to find. Suppose, for example, you have conducted a Students-t test
for equal means and the result is borderline. Maybe your two samples
differ from each other, or maybe they don't; based on the result
@@ -346,7 +373,7 @@
       0.05, // maximum risk of falsely rejecting the null-hypothesis.
       0.1, // maximum risk of falsely failing to reject the null-hypothesis.
       0.13); // sample standard deviation
-
+
Returns the number of degrees of freedom required to obtain a 95%
probability that the observed differences in means is not down to
chance alone. In the case that a borderline Students-t test result
@@ -354,8 +381,10 @@
would have to become before the observed difference was considered
significant. It assumes, of course, that the sample mean and standard
deviation are invariant with sample size.
-
-[h4 Summary]
+
+[endsect] [/ section:parameters Parameters can be calculated]
+
+[section:summary Summary]

* Distributions are objects, which are constructed from whatever
parameters the distribution may have.
@@ -372,6 +401,7 @@

Now that you have the basics, the next section looks at some worked examples.

+[endsect] [/section:summary Summary]
[endsect] [/section:overview Overview]

[section:weg Worked Examples]

Next message: pbristow_at_[hidden]: "[Boost-commit] svn:boost r50856 - sandbox/SOC/2007/visualization/boost/svg_plot/detail"
Previous message: pbristow_at_[hidden]: "[Boost-commit] svn:boost r50854 - sandbox/math_toolkit/libs/math/doc/sf_and_dist"

Date view	Thread view	Subject view	Author view

Boost-Commit list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk