# Boost :

From: Topher Cooper (topher_at_[hidden])
Date: 2006-07-11 12:32:04

At 11:02 AM 7/11/2006, Paul A Bristow wrote:

>| So let's use the Students T distribution as an example. The
>| Students T
>| distribution is a *family* of 1-dimensional distributions
>| that depend on a single parameter, called "degrees of freedom".
>
>Does the word *family* implies integral degrees of freedom?
>Numerically, and perhaps conceptually, it isn't - it's a continuous real.
>So could one also regard it as a two parameter function f(t, v) ?
>However I don't think this matters here.

No, a "family of distributions" does not imply that the parameters
are integral. What is frequently referred to as *the* normal
distribution is also a family parameterized by the mean and standard
deviation. Transformation between members of the family is so easy
that we generally transform everything into and from one member of
the family the "standard normal" distribution.

Keep in mind that a distribution is not a function, although it is
associated with several functions or function-like entities.

Standard usage is to consider the distributions in the family to be
indexed by parameters and therefore the associated functions to be
indexed, single parameter functions. There isn't much difference
mathematically, though, between p[mu, sigma](x) and p(mu, sigma, x)
(even when the indexes *are* integral), and sometimes it is useful to
reframe them in that way. The point is, that is a reframing, and the
standard (no, I am not imagining that it is standard) usage is to
treat single-dimensional distributions as being single-dimensional.

>| Given a value, say, D,
>| for the degrees of freedom, you get a density function p_D and
>| integrating it gives you the cumulative density function P_D.
>
>
>| As I mentioned before, these should be member functions,
>| which could be called "density" (also called 'mass')
>
>| and "cumulative".
>
>OHOH many books don't mention either of these words!

But I would be very, very surprised to find many serious statistics
books written in English that don't.

>The whole nomenclature seems a massive muddle,
>with mathematicians, statistics, and users or all sorts using different
>terms
>and everyone thinks they are the 'Standard' :-(

Some variation exists due to the interdisciplinary origin and
continued nature of the field, but most of the terminology is pretty
standard with some enclaves of specialized usage.

>And the highest priority in my book is the END USERS,
>not the professionals.

Exactly -- the professionals are aware of the non-standard
usage. Lets give the end users a chance of being able to use what
they learned in their high school stat class.

>
>| The cumulative density function is a strictly increasing
>| function and
>| therefore can be inverted. The inverse function could be called
>| "inverse_cumulative", which is a completely unambiguous name.
>
>But excessively long :-(
>
>| I would say that these three member functions should be
>| common to all
>| implemented distributions. Other common member functions
>| might include
>| "mean", "variance", and possibly others.
>
>Median, mode, variance, skewness, kurtosis are common given, for example:
>

Skewness and kurtosis are generally defined but rarely used for
distributions. Their computation on small or even moderate samples
tends to be rather unstable, so comparison to the ideal distributions
isn't terribly useful. I wouldn't bother with them. Mode is not
uniquely defined for many distributions, nor is it that commonly used
(even if the references give a formula) in practice for unimodal
distributions. Except for some specialized uses, these are more
useful for theory than for computation -- more algebraic than numerical.

There are a lot of other possible associated functions, such as
general quantiles or various confidence intervals, but I don't think
many of them have general enough use to bother with for all
distributions. People who need it could use the distribution as a
template parameter. The only exception I would suggest would be to
include the convenience of the standard deviation as well as the
variance. One might stick in RNG here but that is redundant at this point.

As to naming of the probability functions:

My personal preference would be to use what is probably the most
common abbreviations for the basic functions. They are simple,
compact and standard. Maybe a little obscure for those who only took
statistics in high school or some who only know cookbook statistics
-- but that is what documentation is for. The ignorant are after all
ignorant whatever choice is made, but you can do something about it
by using the standard terms:

dist.pdf(x) -- Probability Density Function, this is what looks like
a "bell shaped curve" for a normal distribution, for example. A.k.a. "p"
dist.cdf(x) -- Cumulative Distribution Function. P
dist.ccdf(x) -- Complementary Cumulative Distribution Function;
ccdf(x) = 1 - cdf(x)
dist.icdf(p) -- Inverse Cumulative Distribution Function: P';
icdf(cdf(x)) = x and vice versa
dist.iccdf(p) -- Inverse Complementary Cumulative Distribution
Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x

Topher