Boost logo

Boost :

From: Topher Cooper (topher_at_[hidden])
Date: 2006-07-12 17:28:13


At 05:11 AM 7/12/2006, you wrote:
> T distribution(T x) const; // Probability Density Function or pdf or p
> T cumulative_probability(T x) const; // Cumulative Distribution
>Function. P
>
>cumulative_probability is too long :-(
>
>Do we REALLY need the cumulative here?
>
> T probability(T x) const; // Cumulative Distribution Function or cdf or
>P

Sorry, as attractive as it seems at first blush, I think just
"probability" is a very poor choice. A very common confusion in
statistics is that people think of the value of the PDF as a
probability -- even though it is not (hence the "D" for
density). Even sophisticated people slip into thinking of it that
way (after all, it *does* represent the probability of an event for
discrete distributions). I think that people are much too likely to
get confused and think that probability means the PDF. Even without
that confusion, there is a legitimate ambiguity for the term: Which
probability? Note for example that in traditional statistical
hypothesis testing, the "p-value" (very roughly speaking, the
probability of falsely rejecting the null hypothesis given the
assumption that the null hypothesis is true) is the complementary CDF
for a 1-tailed test and twice the complementary CDF for most 2-tailed tests.

I don't have as much objection to using "distribution" for the PDF,
but the nit-picker in me is a bit uncomfortable with it. A
distribution is not a function, but to the extent that it can be
identified with a particular function it's the CDF not the PDF (or
the MGF -- the Moment Generating Function -- but lets not even go
there). This is because the CDF is always defined for a distribution
and the PDF (technically defined as the derivative of the CDF) may
not be. Being slightly less pedantic, the *object* is the
distribution, not the value of the function. I realize this is all
pretty fine distinctions, but I would be much more comfortable if the
naming doesn't actively mislead about the technical fine points.

>John Maddock has been muttering about using Boost.Interval with these
>functions.
>It's on his TODO list allegedly ;-)
>
>Would this help with the "CDF(x[ub]) - CDF(x[lb])"?

An interesting suggestion. Passing a single value to the function
would give the CDF from -Infinity. Passing an interval would
integrate over that interval. The problem is that, as I understand
it, Boost.Interval objects represent Interval Arithmetic intervals --
i.e., computational error bounds around an unknown correct
value. Using them to represent a more general range of reals
violates their semantics. I would expect the result of passing an
interval parameter to a CDF function to be an interval (easily
implemented for CDF since its a non-decreasing function, but
potentially trickier for the PDF) not a single value. Using a pair
of T or something similar makes more sense, but it seems to me that
the constuctor verbiage is a bit top heavy.

>And/or allow one to produce "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])"
>using the density/mass/distribution?

I would say using a range (but not an Interval) with the PDF does
feel a bit cleaner than with the CDF. Then a single value would
produce the PDF, a range from -Infinity would produce the same value
as the CDF, a range to Infinity would produce the same value as the
complementary CDF. Having to construct the range still would seem
unnecessary cruft. Just allow either one argument or two argument
forms (despite the "defaulted" parameter being the wrong one). I'd
almost give up my objections to calling that function "distribution".

Of course I would not suggest blindly using that little approximation
I threw out. I just included it to make it clear that the value
could be distinctly different from 0 even when computing the
difference explicitly would lead to severe round-off problems.

That formula can be seen as either a zero-order numerical integration
or the first term of the differences in the differences of the Taylor
series off the midpoint. Except for very small intervals you would
want to add more terms either way. The Taylor series improves
rapidly -- specifically quadratically (the next term is the second
derivative of the PDF times the cube of the interval width divided by 24).

You might run into some grey areas, though: regions where using the
difference would produce unacceptable roundoff loss but the width is
too large for effective use of small interval approximations.

As I said, for the first release, I'd just implement it using the
difference of the CDFs then worry about improving it later.

Topher


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk