
Boost : 
From: Paul A Bristow (pbristow_at_[hidden])
Date: 20060713 12:39:41
 Original Message
 From: boostbounces_at_[hidden]
 [mailto:boostbounces_at_[hidden]] On Behalf Of Deane Yang
 Sent: 12 July 2006 23:14
 To: boost_at_[hidden]
 Subject: Re: [boost] [math/staticstics/design] How best
 tonamestatisticalfunctions?

 Topher Cooper wrote:
 > At 05:11 AM 7/12/2006, you wrote:
 >> T distribution(T x) const; // Probability Density
 Function or pdf or p
 >> T cumulative_probability(T x) const; // Cumulative
 Distribution
 >> Function. P
 >>
 >> cumulative_probability is too long :(
 >>
 >> Do we REALLY need the cumulative here?
 >>
 >> T probability(T x) const; // Cumulative Distribution
 Function or cdf or
 >> P
 >
 > Sorry, as attractive as it seems at first blush, I think just
 > "probability" is a very poor choice. ...

 <explanation about why and discussion about using intervals snipped>

 I definitely do not want to use the same function name for both the
 density function and the cumulative probability. Your point
 about people
 confusing the meaning of the density function is on the mark, and I
 think using the same function name will only exacerbate the
 confusion.

 Do I would still vote for:

 double density(double x) const;

 (Despite the origin of the word "density" from physics, it
 is definitely
 used by mathematicians, statisticans, and engineers to mean exactly
 this. And I agree that the word "distribution" is not a synonym for
 "density".)

 On the other hand, I like the idea of using an interval type for the
 "probability" function and requiring an explicit interval
 constructor
 when calling the function, like

 student_t dist(2.0);
 double p = dist.probability(interval(1.0, 2.0));
 double q = dist.probability(interval(infinity, 1.0));

 To me, syntax like this just makes it easier for me to
 understand what's
 going on.

 And I agree that we shouldn't just use the Boost Interval library. I
 think we should define an interval class specific to the statistics
 library, where the left endpoint is allowed to be infinity and the
 right endpoint +infinity.

 Then we get a syntax that is easy to read and understand,
 and we don't
 need to come up with a good name for the cumulative or complementary
 cumulative probability functions.
I've quickly knocked up a very rough sketch of how it might look like this
(attached a zip of a .cpp run on MSVC 8.0)
I'm sure you can suggest improvements to this.
Seeing it used makes my still quite like a single function name
'probability' (with 1 parameter for pdf and two for cdf(s)) but I am willing
to be outvoted. Neat but riskier.
I also attached a response from Daniel Egloff making a similar, but more
advanced proposal.
(as John notes, the downside with a class is difficulty of extension).
However, I am just about to go on holiday for two weeks, so I will leave you
all to discuss further, and hope you've got everything sorted out and an
example code written by the time I get back ;))
Thanks
Paul
 Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow_at_[hidden]
attached mail follows:
Just got these comments in my mailbox, thought you should see them too, not
sure I understand it all yet, but the suggestion is to define a distribution
as a parameterized object with nonmember access functions, and adapters to
convert the object to do other things. The main advantage I see is that
it's both extensible and generic (classes aren't extensible in the sense
that you have to hack the source to add additional member functions).
John.
daniel.egloff_at_[hidden] wrote:
> Dear all
>
> I find the underscore notation a bad aproach. Also a loose aproach
> like "wacky scheme 3" is bad.
> I prefer to construct an object with the right parameterization and to
> interrogate its functional aspects
> via feature "extractors" which return the resulting function as a
> functor, which then can be passed to
> an algorithm. This is the usual use case for appling such a library.
>
> It looks like that we might use the accumulator set extractor
> mechanism, and the feature grouping,
> because some feature depend on other! Same story as with our stats
> libray on top of the accumulator set
> framework.
>
> I hate short forms like P and Q below. This is not universal and there
> is some dispute about mathematicians
> how to name and label things. Clear selfexplaining names are much
> better, because novice users and
> nondomain experts profit a lot from a clear and selfexplaining naming
> scheme.
>
> Also Paul Bristol has forgotten some other aspects that are VERY
> important. For all the
> distributions the following quantities can be calculated, as a
> function of the
> parametrization, often require special functions, or some numerical
> stuff, like
> inversion, zero search via Newton or other methods.
>
> quantile function, e.g. inverse cumulative distribution function
> q(\alpha) = \inf{x : F(x) > alpha}
>
> density (can be discrete > time_series library!)
>
>
> expectation E[X]
>
> variance E[X^2]  E[X]^2
>
> kth moments (2 and higher) E[X^k]
>
> cumulant generating function log E[e^{tX}] (might be
> implemented differently than moment gen. function
>
> Laplace transform or moment generating function E[e^{tX}]
>
> Probability generating function for discrete probability
> distributions
>
> characteristic function E[e^{izX}]
>
> (log) likelihood function as a function of the
> parameters > usable for maximum likelihood estimation
>
>
> Some features do sometimes not exist: Hence the Cauchy distribution
> does not have a mean.
> A Pareto distribution with \alpha < 2 has a mean but not a second
> moment, and so on....
>
> Having so many functional aspects of a random variable, e.g. features
>
> /* internal */
> typedef accumulator_set<double /* gives the numerical resolution of
> the real arithmetic*/
> , feature<gauss_density, gauss_distribution_function,
> gauss_cumulative_distribution, gauss_moments<infinity>......> >
> gauss_distribution
>
> // internaly we have a lot of feature names, but you can't get around
> that anyway.
>
> /* usage */
> gauss_distribution g(mean = 0.0, variance = 1.0)
> assert(expectation(g) == 0.0, variance(g) == 1.0)
>
> UnaryFunction f = CumulativeDistribution(g);
> double p = f(0);
> assert(p == 0.5);
>
> // .... or use f in an algorithm.....
>
> /* likeihood function */
> std::vector vec;
> vec = x1, x2, ..., xn;
> BinaryFunction l = Likelihood(g, samples = vec)
>
> // vec = x_1, ..., x_n
> // l(m, sigma) = sigma^{n/2} exp( (x_i  mu)^2 / (2 sigma^2))
>
> // .... use l in an algorithm: eg. give it to an optimizer to find m
> and sigma fitting a Gaussian to the sample...
>
> // much more to come....
>
>
> Internally you still need a lot of "features" but they are nicely
> grouped and exposed as a container to the user.
> The parametrization is transparent. And you can combine it with our
> iterative stats library.
>
> For an idea what should/could be provided the Mathematica statistics
> package is a good example!
> It follows a similar design as indicated above.
>
> How does that sound?
>
> How do you think to proceed. I migth be interested, at an open source
> project level, not on a commerical level.
>
>
> Freundliche Grüsse
> Daniel Egloff
> Zürcher Kantonalbank, ZEF
> Josefstrasse 222, 8005 Zürich
> Tel. +41 (0) 44 292 45 33, Fax +41 (0) 44 292 45 95
> Briefadresse: Postfach, 8010 Zürich, http://www.zkb.ch
>
>
>> +>
>>  David Abrahams 
>>  <dave_at_boostconsu
>>  lting.com> 
>>  
>>  08.07.2006 18:48 
>>  
>> +>
>
>
>

> 
> 
>  An: Matthias Troyer <troyer_at_[hidden]>, Eric
> Niebler <eric_at_[hidden]>, Daniel Egloff 
>  <daniel.egloff_at_[hidden]>, daniel.egloff_at_[hidden]
> 
>  Kopie:
> 
>  Thema: Thought you might have a stake in this...
> 
>
>
>

>
>
>
>
>
>  Nachricht von "John Maddock" <john_at_[hidden]> auf Sat, 8
> Jul 2006 17:37:28 +0100 
>
> Thema: [math/staticstics/design] How best to name
> statistical functions?
>
>
> Paul Bristow has been toiling away producing some statistical
> functions on
> top of some of my Math special functions, and we've encountered a bit
> of a
> naming dilemma that I hope the ever resourceful Boosters can solve for
> us
> :)
>
> For a given cumulative distribution function (I'm going to use the
> studentst function as an example below) there are two (or maybe
> three)
> variations:
>
> P: this is the regular cumulative distribution function, and is a
> rising
> function in it's argument (rises from 0 to 1).
>
> Q: this is 1P and is also known as the complement of the cumulative
> distribution function. It falls from 1 to 0 over the range of it's
> argument.
>
> A: this is less well used and is PQ or 12Q depending upon your
> point of
> view.
>
> Naming scheme 1:
> ~~~~~~~~~~~~~~~~
>
> We have the reasonably obvious:
>
> students_t(df,x) : calculates P
> students_t_c(df,x) : calculates Q
>
> However that varies slightly from the existing practice of erf/erfc
> which if
> followed here would lead to:
>
> students_t(df,x) : calculates P
> students_tc(df,x) : calculates Q
>
> but the lack of the underscore doesn't look right to me.
>
> Naming Scheme 2:
> ~~~~~~~~~~~~~~~~
>
> How about we call a spade a spade and use:
>
> students_t_P(df,x) : calculates P
> students_t_Q(df,x) : calculates Q
>
> Not pretty, but the P and Q notations are universally used in the
> literature, and of course we could handle the A case as well if that
> was
> felt to be needed.
>
> It doesn't follow normal Boost all_lower_case_names either, but since
> lower
> case "p" and "q" have slightly different meanings in the literature
> (they're
> for values of P and Q) I'm less keen on:
>
> students_t_p(df,x) : calculates P
> students_t_q(df,x) : calculates Q
>
> Wacky Scheme 3:
> ~~~~~~~~~~~~~~~
>
> Both of the above suffer from a rather spectacular explosion of
> function
> prototypes once you include every variant for each distribution, an
> alternative using named parameters might be:
>
> P(dist=students_t, df=4, x=5.2); // P for 4 degrees freedom and
> x=5.2
> Q(dist=students_t, df=5, x=20.0); // Q for 5 degrees freedom and
> x=20.0
>
> But of course internally this would have to forward to something like
> (1) or
> (2) so it doesn't actually save you any implementation effort, just
> reduces
> the number of names.
>
> Inverses:
> ~~~~~~~~~
>
> And if that's not enough, we also have inverses:
>
> * Calculate x given degrees of freedom and P.
> * Calculate x given degrees of freedom and Q.
> * Calculate degrees of freedom given x and P.
> * Calculate degrees of freedom given x and Q.
>
> At present we're looking at something like:
>
> students_t_inv(df,p); // Calculate x given degrees of freedom and P.
>
> But the other variants don't have obvious names under this scheme?
>
> So I'm hoping some Boosters can work their usual naming magic :)
>
> Many thanks,
>
> John.
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk