
Boost : 
From: Paul A Bristow (pbristow_at_[hidden])
Date: 20060711 04:38:58
 Original Message
 From: boostbounces_at_[hidden]
 [mailto:boostbounces_at_[hidden]] On Behalf Of Deane Yang
 Sent: 10 July 2006 21:41
 To: boost_at_[hidden]
 Subject: Re: [boost] [math/staticstics/design] How best to
 name statisticalfunctions?

 Paul A Bristow wrote:
 >  Original Message
 >  From: Kevin Lynch

 >  Why not hide the functions behind a class interface?
 After all, the
 >  various functions are "properties" of the distributions. Hence:
 > 
 >  class students_t {
 >  students_t(double mu);
 >  double P(double x);
 >  double Q(double x);
 >  double invP(double p); (or perhaps inverseP or Pinv or
 >  something)
 >  .....
 >  }
 > 
 >  class normal {
 >  normal(double mu, double sigma);
 >  double P(double x);
 >  double Q(double x);
 >  double invP(double x);
 >  ......
 >  }
 >
 > Rather interesting idea.

 I support Kevin's proposal rather strongly for exactly the
 reasons he
 states. But I'm not sure what P, Q, invP mean. I would prefer:

 double density(double x);
 double cumulative(double x);
 double inverse_cumulative(double y);

 > How would you envisage this working with Fisher, for
 example which has
 > degrees of freedom 1 and 2, and a variance ratio.
 >
 > Is this a 1D or 2D or 3D?
 >
 > Its inversion will return df1 (given df2 and F and Probability)
 > or df2 (given df1, F and Prob)
 > or F (given Df1 and df2 and Prob)
 >
 > WOuld you like to flesh out how you suggest handling all these?
 >

 Could you clarify your question? Isn't the F distribution still the
 probability distribution of a single real random variable? The
 cumulative and inverse cumulative density functions have a
 consistent mathematical meaning for any 1dimensional probability
 distribution, do they not?
Well, if you regard the degrees of freedom as fixed, or the probability as
fixed, often 95%,
then yes,
but, I would say that they are 2D (and others 3D) distributions.
To keep it simpler, lets go back to the students t which I have
implemented (actually templates but ignore that for now) as
double students_t(double degrees_of_freedom, double t)
t is roughly a measure of difference between two things (means for example)
this returns the probability that the things are different.
If degrees_of_freedom are small (you only measured 3 times, say),
then t can be big, but it still doesn't mean much.
But if you made a 100 measurements, it probably does.
When you do the inverse, you may want to say, I want to be 95% confident,
and I already have fixed the degrees_of_freedom, so what is the
corresponding
value for t. This is what the ubiquitous styudent's t tables do.
On the other hand, sometimes you may decide you want 95% confidence, and you
have already made some measurements of t, but you want to know how many
(more probably) measurements (degrees_of_freedom) you would have to make to
get this 95%.
This is common problem  and often reveals in drug trials, for example, that
there are not enough potential patients available to carry out a trial and
achieve a 95% probability.
If you accept this, then the problem is how to name the two, or three
'inverses' (and complements).
students_t_inv_t and students_t_inv_df ???
Paul
PS I also worry about the risk of code bloat. At present, I think that you
don't pay for what you don't use. We certainly don't want all the possible
functions discussed above instantiated, even for one floatingpoint type, if
only one function is actually used.
 Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow_at_[hidden]
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk