 Boost :

From: Paul A Bristow (pbristow_at_[hidden])
Date: 2006-07-13 12:39:41

| -----Original Message-----
| From: boost-bounces_at_[hidden]
| [mailto:boost-bounces_at_[hidden]] On Behalf Of Deane Yang
| Sent: 12 July 2006 23:14
| To: boost_at_[hidden]
| Subject: Re: [boost] [math/staticstics/design] How best
| tonamestatisticalfunctions?
|
| Topher Cooper wrote:
| > At 05:11 AM 7/12/2006, you wrote:
| >> T distribution(T x) const; // Probability Density
| Function or pdf or p
| >> T cumulative_probability(T x) const; // Cumulative
| Distribution
| >> Function. P
| >>
| >> cumulative_probability is too long :-(
| >>
| >> Do we REALLY need the cumulative here?
| >>
| >> T probability(T x) const; // Cumulative Distribution
| Function or cdf or
| >> P
| >
| > Sorry, as attractive as it seems at first blush, I think just
| > "probability" is a very poor choice. ...
|
| <explanation about why and discussion about using intervals snipped>
|
| I definitely do not want to use the same function name for both the
| density function and the cumulative probability. Your point
| confusing the meaning of the density function is on the mark, and I
| think using the same function name will only exacerbate the
| confusion.
|
| Do I would still vote for:
|
| double density(double x) const;
|
| (Despite the origin of the word "density" from physics, it
| is definitely
| used by mathematicians, statisticans, and engineers to mean exactly
| this. And I agree that the word "distribution" is not a synonym for
| "density".)
|
| On the other hand, I like the idea of using an interval type for the
| "probability" function and requiring an explicit interval
| constructor
| when calling the function, like
|
| student_t dist(2.0);
| double p = dist.probability(interval(-1.0, 2.0));
| double q = dist.probability(interval(infinity, -1.0));
|
| To me, syntax like this just makes it easier for me to
| understand what's
| going on.
|
| And I agree that we shouldn't just use the Boost Interval library. I
| think we should define an interval class specific to the statistics
| library, where the left endpoint is allowed to be -infinity and the
| right endpoint +infinity.
|
| Then we get a syntax that is easy to read and understand,
| and we don't
| need to come up with a good name for the cumulative or complementary
| cumulative probability functions.

I've quickly knocked up a very rough sketch of how it might look like this
(attached a zip of a .cpp run on MSVC 8.0)

I'm sure you can suggest improvements to this.

Seeing it used makes my still quite like a single function name
'probability' (with 1 parameter for pdf and two for cdf(s)) but I am willing
to be out-voted. Neat but riskier.

I also attached a response from Daniel Egloff making a similar, but more

(as John notes, the downside with a class is difficulty of extension).

However, I am just about to go on holiday for two weeks, so I will leave you
all to discuss further, and hope you've got everything sorted out and an
example code written by the time I get back ;-))

Thanks

Paul

---
Paul A Bristow
Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB
+44 1539561830 & SMS, Mobile +44 7714 330204 & SMS
pbristow_at_[hidden]

attached mail follows:

Just got these comments in my mailbox, thought you should see them too, not
sure I understand it all yet, but the suggestion is to define a distribution

as a parameterized object with non-member access functions, and adapters to
convert the object to do other things. The main advantage I see is that
it's both extensible and generic (classes aren't extensible in the sense
that you have to hack the source to add additional member functions).

John.

daniel.egloff_at_[hidden] wrote:
> Dear all
>
> I find the underscore notation a bad aproach. Also a loose aproach
> like "wacky scheme 3" is bad.
> I prefer to construct an object with the right parameterization and to
> interrogate its functional aspects
> via feature "extractors" which return the resulting function as a
> functor, which then can be passed to
> an algorithm. This is the usual use case for appling such a library.
>
> It looks like that we might use the accumulator set extractor
> mechanism, and the feature grouping,
> because some feature depend on other! Same story as with our stats
> libray on top of the accumulator set
> framework.
>
> I hate short forms like P and Q below. This is not universal and there
> is some dispute about mathematicians
> how to name and label things. Clear selfexplaining names are much
> better, because novice users and
> non-domain experts profit a lot from a clear and selfexplaining naming
> scheme.
>
> Also Paul Bristol has forgotten some other aspects that are VERY
> important. For all the
> distributions the following quantities can be calculated, as a
> function of the
> parametrization, often require special functions, or some numerical
> stuff, like
> inversion, zero search via Newton or other methods.
>
> quantile function, e.g. inverse cumulative distribution function
> q(\alpha) = \inf{x : F(x) > alpha}
>
> density (can be discrete -> time_series library!)
>
>
> expectation E[X]
>
> variance E[X^2] - E[X]^2
>
> k-th moments (2 and higher) E[X^k]
>
> cumulant generating function log E[e^{tX}] (might be
> implemented differently than moment gen. function
>
> Laplace transform or moment generating function E[e^{tX}]
>
> Probability generating function for discrete probability
> distributions
>
> characteristic function E[e^{izX}]
>
> (log) likelihood function as a function of the
> parameters -> usable for maximum likelihood estimation
>
>
> Some features do sometimes not exist: Hence the Cauchy distribution
> does not have a mean.
> A Pareto distribution with \alpha < 2 has a mean but not a second
> moment, and so on....
>
> Having so many functional aspects of a random variable, e.g. features
>
> /* internal */
> typedef accumulator_set<double /* gives the numerical resolution of
> the real arithmetic*/
> , feature<gauss_density, gauss_distribution_function,
> gauss_cumulative_distribution, gauss_moments<infinity>......> >
> gauss_distribution
>
> // internaly we have a lot of feature names, but you can't get around
> that anyway.
>
> /* usage */
> gauss_distribution g(mean = 0.0, variance = 1.0)
> assert(expectation(g) == 0.0, variance(g) == 1.0)
>
> UnaryFunction f = CumulativeDistribution(g);
> double p = f(0);
> assert(p == 0.5);
>
> // .... or use f in an algorithm.....
>
> /* likeihood function */
> std::vector vec;
> vec = x1, x2, ..., xn;
> BinaryFunction l = Likelihood(g, samples = vec)
>
> // vec = x_1, ..., x_n
> // l(m, sigma) = sigma^{n/2} exp( (x_i - mu)^2 / (2 sigma^2))
>
> // .... use l in an algorithm: eg. give it to an optimizer to find m
> and sigma fitting a Gaussian to the sample...
>
> // much more to come....
>
>
> Internally you still need a lot of "features" but they are nicely
> grouped and exposed as a container to the user.
> The parametrization is transparent. And you can combine it with our
> iterative stats library.
>
> For an idea what should/could be provided the Mathematica statistics
> package is a good example!
> It follows a similar design as indicated above.
>
> How does that sound?
>
> How do you think to proceed. I migth be interested, at an open source
> project level, not on a commerical level.
>
>
> Freundliche Grüsse
> Daniel Egloff
> Zürcher Kantonalbank, ZEF
> Josefstrasse 222, 8005 Zürich
> Tel. +41 (0) 44 292 45 33, Fax +41 (0) 44 292 45 95
> Briefadresse: Postfach, 8010 Zürich, http://www.zkb.ch
>
>
>> ---------+---------------------------->
>> | David Abrahams |
>> | <dave_at_boost-consu|
>> | lting.com> |
>> | |
>> | 08.07.2006 18:48 |
>> | |
>> ---------+---------------------------->
>
>
>---------------------------------------------------------------------------
--------------------------------------------------|
> |
> |
> | An: Matthias Troyer <troyer_at_[hidden]>, Eric
> Niebler <eric_at_[hidden]>, Daniel Egloff |
> | <daniel.egloff_at_[hidden]>, daniel.egloff_at_[hidden]
> |
> | Kopie:
> |
> | Thema: Thought you might have a stake in this...
> |
>
>
>---------------------------------------------------------------------------
--------------------------------------------------|
>
>
>
>
>
> ----- Nachricht von "John Maddock" <john_at_[hidden]> auf Sat, 8
> Jul 2006 17:37:28 +0100 -----
>
> Thema: [math/staticstics/design] How best to name
> statistical functions?
>
>
> Paul Bristow has been toiling away producing some statistical
> functions on
> top of some of my Math special functions, and we've encountered a bit
> of a
> naming dilemma that I hope the ever resourceful Boosters can solve for
> us
> :-)
>
> For a given cumulative distribution function (I'm going to use the
> students-t function as an example below) there are two (or maybe
> three)
> variations:
>
> P: this is the regular cumulative distribution function, and is a
> rising
> function in it's argument (rises from 0 to 1).
>
> Q: this is 1-P and is also known as the complement of the cumulative
> distribution function. It falls from 1 to 0 over the range of it's
> argument.
>
> A: this is less well used and is P-Q or 1-2Q depending upon your
> point of
> view.
>
> Naming scheme 1:
> ~~~~~~~~~~~~~~~~
>
> We have the reasonably obvious:
>
> students_t(df,x) : calculates P
> students_t_c(df,x) : calculates Q
>
> However that varies slightly from the existing practice of erf/erfc
> which if
> followed here would lead to:
>
> students_t(df,x) : calculates P
> students_tc(df,x) : calculates Q
>
> but the lack of the underscore doesn't look right to me.
>
> Naming Scheme 2:
> ~~~~~~~~~~~~~~~~
>
> How about we call a spade a spade and use:
>
> students_t_P(df,x) : calculates P
> students_t_Q(df,x) : calculates Q
>
> Not pretty, but the P and Q notations are universally used in the
> literature, and of course we could handle the A case as well if that
> was
> felt to be needed.
>
> It doesn't follow normal Boost all_lower_case_names either, but since
> lower
> case "p" and "q" have slightly different meanings in the literature
> (they're
> for values of P and Q) I'm less keen on:
>
> students_t_p(df,x) : calculates P
> students_t_q(df,x) : calculates Q
>
> Wacky Scheme 3:
> ~~~~~~~~~~~~~~~
>
> Both of the above suffer from a rather spectacular explosion of
> function
> prototypes once you include every variant for each distribution, an
> alternative using named parameters might be:
>
> P(dist=students_t, df=4, x=5.2); // P for 4 degrees freedom and
> x=5.2
> Q(dist=students_t, df=5, x=20.0); // Q for 5 degrees freedom and
> x=20.0
>
> But of course internally this would have to forward to something like
> (1) or
> (2) so it doesn't actually save you any implementation effort, just
> reduces
> the number of names.
>
> Inverses:
> ~~~~~~~~~
>
> And if that's not enough, we also have inverses:
>
> * Calculate x given degrees of freedom and P.
> * Calculate x given degrees of freedom and Q.
> * Calculate degrees of freedom given x and P.
> * Calculate degrees of freedom given x and Q.
>
> At present we're looking at something like:
>
> students_t_inv(df,p); // Calculate x given degrees of freedom and P.
>
> But the other variants don't have obvious names under this scheme?
>
> So I'm hoping some Boosters can work their usual naming magic :-)
>
> Many thanks,
>
> John.
>
>
> _______________________________________________
> Unsubscribe & other changes: