Boost logo

Boost :

From: Topher Cooper (topher_at_[hidden])
Date: 2006-07-14 11:20:32


At 09:01 AM 7/14/2006, John Maddock wrote:

>So the quantile gives you number of successes expected at a given
>probablity, but for many scientists, they'll measure the number of successes
>and want to invert to get the probability of one success (parameter p).
>
>Hopefully, I've actually got this right this time, I'm sure someone will
>jump in if not.... ?

Jumping in.

That isn't a functional inversion at all. Given a particular set of
observations presumed to be a sampling from an unknown member of a
family of distributions one can define an estimator -- a computation
on the observed values -- for the distribution parameter. Generally
multiple estimators are available. We are interested in the
difference between the estimator and the unknown "true"
value. Through some indirect thinking we can treat the true value as
a random variable (sort of -- statisticians will cringe here) and the
difference becomes a random variable as well -- with its own
distribution. Essentially the point estimator is mean or similar
value for the distribution. Current practice prefers using a
confidence interval rather than a point estimate.

Here is a common (and commonly confused) example of multiple
estimators. You have a sample of values and you want an estimator
for the variance and you have no theoretical knowledge of the mean --
there are two common choices:

     S1 = sum((x[i] - mean(x))^2) / N

and

     S2 = sum(x[i] - mean(x))^2) / (N-1)

Which should you use? The distribution of the error in the first has
a slightly smaller variance and so, in a sense, is a more accurate
estimator. The usual advice though is to go with the second. The
reason is that the first has a bias to it, leading to the possibility
of accumulating large errors, while the second is unbiased. Doesn't
make much difference for large samples, but you can choose whichever
you want for small samples.

Note:

      1) Estimators can be for any population statistic, not just
ones that happen to be used as parameters for the distribution family.

      2) As I said, there can be more than one estimator for a given
statistic. For example the sample median may be used as an estimator
for the population mean when symmetry can be assumed since it is less
sensitive to "ooutliers" than the sample mean.

      3) Estimators are based on arbitrary computations on a sample
of values which may not be directly related to a distribution
parameter like the "hit count" is in your example. They are not, in
general, a matter of plugging in a simple set of known scalar values.

      4) You are also interested in auxiliary information for an
estimator -- basically information about its error distribution about
the true population statistic. For example, when you use the sample
mean to estimate the distribution parameter mu (or equivalently, the
population mean) of a presumed normal distribution you are interested
in the "standard error" the estimated standard deviation of the
estimator around the true mean.

I don't think that this is really the kettle of worms you want to open up.

>All of which means that in addition to a "generic" interface - however it
>turns out - we will still need distribution-specific ad-hock functions to
>invert for the parameterisation values, as well as the random variable.

Now there, I agree with you. Putting some commonly used computations
in (e.g., standard error given sample size and sample standard
deviation) would be nice. But don't kid yourself that you are going
to build in all of, say, Regress into this library in any reasonable
amount of time. Hit the high points and don't even try for completeness.

Topher


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk