Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2007-01-30 13:35:03


I'm including Matthias Troyer on this, in the hopes he might have some
feedback about your kurtosis suggestion and about your torture test.

John Maddock wrote:
> This is the first part, of what will probably end up a multi-part review.
>
> First the headline - I do think that the library should be included in
> Boost - although I do have some reservations as noted below.

Thanks, John.

>
> This part-review covers the documentation and a few simple tests I've
> conducted. Later I'll try writing some accumulators so that I can comment
> better on the design.
>
> Documentation review
> ~~~~~~~~~~~~~~~~~~~
>
> Tutorial:
> ~~~~~~~~~
>
> I'm fairly sure that the whole accumulator_set_wrapper business could be
> completely avoided with a TR1 conforming reference_wrapper - which
> unfortunately we don't have in Boost just yet - none the less this may be
> worthwhile mentioning.

Really? I guess I'm not familiar enough with TR1 reference_wrapper. It
forwards operator(), up to N args?

> The "push all the data in this sequence" idiom seems like a very common use
> case to me, is it not worthwhile supporting this directly via a couple of
> member functions:
>
> template <class It>
> accumulator_set& accumulator_set::push_range(It begin, It end);
>
> template <class Seq>
> accumulator_set& accumulator_set::push_sequence(Seq const& s);

Good idea.

> Typo in "This will calcualte the left"

Yep, thanks.

>
> Statistic accumulators:
> ~~~~~~~~~~~~~~~~~~~~~~
>
> Kurtosis
> ~~~~~~~~
>
> This is going to confuse folks - I know it's got Paul and I into trouble
> already! You're returning the "Kurtosis Excess" rather than the "Kurtosis
> Proper". In the Math Toolkit stats functions we called these functions
> "kurtosis" and "kurtosis_excess", and I'd favour similar naming here, or at
> least a strong warning to the effect that there are multiple definitions in
> the literature - otherwise unwary users will get into all kinds of trouble.

Matthias?

> Differing implementations?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> feature_of and as_feature get mentioned in the tutorial, along with a
> sentence to say that there can be different implementations of the same
> feature, but there is no explanation of how this works, and the links to the
> reference pages for as_feature and feature_of are distinctly detail free :-(

Yes, I need to document these.

>
> This all leads on to the algorithms used for the mean and variance etc: I
> assume that these are of the "naive" variety? Perfectly satisfactory for
> simple cases, but there are well known pathological situations where these
> fail.

Very naive, yes.

>
> I'm also missing accessors for: the standard deviation, "unbiased" variance,
> and the unbiased (N-1) standard deviation. I realise I could write these
> myself, but they would seem to be common enough cases to warrant inclusion
> in the library. As it stands, I had to jump through hoops to get data I
> could compare with "known good values".

These would be nice to have. I'm gladly accepting patches. :-)

> BTW, I assume that there are no accumulators for Autocorrelation Coefficient
> calculation?

(This is the point at which I reveal my statistical naiveté.) Huh?

>
> Which leads on to a quick comparison I ran against the "known good" data
> here: http://www.itl.nist.gov/div898/strd/univ/homepage.html The test
> program is attached, and outputs the relative error in the statistics
> calculated. Each test is progressively harder than the previous one, the
> output is:
>
> PI data:
> Error in mean is: 0
> Error in SD is: 3.09757e-016
>
> Lottery data:
> Error in mean is: 6.57202e-016
> Error in SD is: 0
>
> Accumulator 2 data:
> Error in mean is: 9.25186e-015
> Error in SD is: 2.71685e-012
>
> Accumulator 3 data:
> Error in mean is: 5.82076e-016
> Error in SD is: 0.0717315
>
> Accumulator 4 data:
> Error in mean is: 9.87202e-015
> Error in SD is: -1.#IND
>
> As you can see the calculated standard deviation gets progressively worse,
> in the final case, the computed variance is actually -2, which is clearly
> non-sensical (and hence the NaN when using it to compute the standard
> deviation) :-(
>
> I haven't tried to debug these cases: stepping through the accumulator code
> is not something I would recommend to anyone frankly (having tried it
> previously).
>
> No doubt other torture tests could be constructed for the mean -
> exponentially distributed data where you see one or two large values,
> followed by a large number of very tiny values springs to mind - this is the
> sort of situation that the Kahan summation algorithm gets employed for.
>
> I'm not sure what one should do about this. These certainly illustrate the
> kinds of trap that the unwary can fall into. It's particularly a problem
> for the "scientist in a rush" who may not be either a C++ or statistical
> expert (whether (s)he thinks they are or not!) but needs a library to
> quickly and reliably calculate some stats. Better documentation would
> certainly help, but guidance on (or provision of - perhaps by default)
> better algorithms might be welcome as well.

I'm hoping that Matthias can comment on your test and results.
(Matthias, John posted his test in later msgs to the boost-devel list.)
I implemented the accumulators very naively according to equations
provided by Matthias. It's possible I needed to be more clever with the
implementations, or that Matthias was unconcerned (or unaware) of the
way the statistical accumulators would perform on torture tests such as
these.

The framework accommodates many different implementations of each
feature. The idea was to allow quick-and-dirty approximations as well
has slower, more accurate implementations. I would LOVE for a
statistical expert to contribute some highly accurate statistical
accumulators. Perhaps they can be based on the good work you've done in
this area. (Nudge, nudge. :-)

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk