|
Boost : |
From: Paul A Bristow (pbristow_at_[hidden])
Date: 2006-07-12 05:11:24
| -----Original Message-----
| From: boost-bounces_at_[hidden]
| [mailto:boost-bounces_at_[hidden]] On Behalf Of Topher Cooper
| Sent: 11 July 2006 17:32
| To: boost_at_[hidden]
| Subject: Re: [boost] [math/staticstics/design] How best to
| namestatisticalfunctions?
|
| At 11:02 AM 7/11/2006, Paul A Bristow wrote:
|
|
| >| So let's use the Students T distribution as an example. The
| >| Students T
| >| distribution is a *family* of 1-dimensional distributions
| >| that depend on a single parameter, called "degrees of freedom".
| >
| >Does the word *family* implies integral degrees of freedom?
|
| No, a "family of distributions" does not imply that the parameters
| are integral. What is frequently referred to as *the* normal
| distribution is also a family parameterized by the mean and standard
| deviation. Transformation between members of the family is so easy
| that we generally transform everything into and from one member of
| the family the "standard normal" distribution.
|
| Keep in mind that a distribution is not a function, although it is
| associated with several functions or function-like entities.
|
| Standard usage is to consider the distributions in the family to be
| indexed by parameters and therefore the associated functions to be
| indexed, single parameter functions. There isn't much difference
| mathematically, though, between p[mu, sigma](x) and p(mu, sigma, x)
| (even when the indexes *are* integral), and sometimes it is
| useful to reframe them in that way. The point is, that is a
| reframing, and the
| standard (no, I am not imagining that it is standard) usage is to
| treat single-dimensional distributions as being single-dimensional.
Thanks, I think I understand better now.
| >And the highest priority in my book is the END USERS,
| >not the professionals.
|
| Exactly -- the professionals are aware of the non-standard
| usage. Lets give the end users a chance of being able to use what
| they learned in their high school stat class.
My main objective :-))
| . Other common member functions might include
| >| "mean", "variance", and possibly others.
| >
| >Median, mode, variance, skewness, kurtosis are common
| given, for example:
| >
| >http://en.wikipedia.org/wiki/Student%27s_t
|
| Skewness and kurtosis are generally defined but rarely used for
| distributions. Their computation on small or even moderate samples
| tends to be rather unstable, so comparison to the ideal
| distributions
| isn't terribly useful. I wouldn't bother with them. Mode is not
| uniquely defined for many distributions, nor is it that
| commonly used
| (even if the references give a formula) in practice for unimodal
| distributions. Except for some specialized uses, these are more
| useful for theory than for computation -- more algebraic
| than numerical.
|
| There are a lot of other possible associated functions, such as
| general quantiles or various confidence intervals, but I don't think
| many of them have general enough use to bother with for all
| distributions. People who need it could use the distribution as a
| template parameter. The only exception I would suggest would be to
| include the convenience of the standard deviation as well as the
| variance. One might stick in RNG here but that is redundant
| at this point.
| As to naming of the probability functions:
|
| My personal preference would be to use what is probably the most
| common abbreviations for the basic functions. They are simple,
| compact and standard. Maybe a little obscure for those who
| only took
| statistics in high school or some who only know cookbook statistics
| -- but that is what documentation is for. The ignorant are
| after all
| ignorant whatever choice is made, but you can do something about it
| by using the standard terms:
|
| dist.pdf(x) -- Probability Density Function, this is what looks like
| a "bell shaped curve" for a normal distribution, for
| example. A.k.a. "p"
| dist.cdf(x) -- Cumulative Distribution Function. P
| dist.ccdf(x) -- Complementary Cumulative Distribution Function;
| ccdf(x) = 1 - cdf(x)
| dist.icdf(p) -- Inverse Cumulative Distribution Function: P';
| icdf(cdf(x)) = x and vice versa
| dist.iccdf(p) -- Inverse Complementary Cumulative Distribution
| Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x
My instinct is that these are too abbreviated, despite their logicalness.
But this is the key problem - being clear, not curt, and yet concise.
students_t.inverse_complement_cumulative_probability certains fails! ;-))
so we a getting to:
template <T> // T an integral or real or floating-point type.
T distribution(T x) const; // Probability Density Function or pdf or p
T cumulative_probability(T x) const; // Cumulative Distribution
Function. P
cumulative_probability is too long :-(
Do we REALLY need the cumulative here?
T probability(T x) const; // Cumulative Distribution Function or cdf or
P
T quantile(T probability) const; // Also known as Inverse cumulative
Distribution Function
what do we call
T complementary_cumulative_probability(T x) const; // Complementary
Cumulative Distribution Function. Q
??? :-((
and worse what about Inverse Complementary Cumulative Distribution
complementary_quantile??? :-((
and the ad hoc 'extra's
static T degrees_of_freedom(T quantile, T probability) const;
So I feel we haven't QUITE got there yet.
But many thanks for your help so far.
Paul
--- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow_at_[hidden] PS Since everybody obviously knows far more about stats that I do, can you also suggest fully worked examples that can be used to demonstrate usage in a tutorial. I'm especailly keen to show how superior using this would be to the traditional tables and fixed 95% confidence limits.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk