Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2007-01-30 15:58:53


On Jan 30, 2007, at 7:35 PM, Eric Niebler wrote:

>
>
>> Statistic accumulators:
>> ~~~~~~~~~~~~~~~~~~~~~~
>> Kurtosis
>> ~~~~~~~~
>> This is going to confuse folks - I know it's got Paul and I into
>> trouble already! You're returning the "Kurtosis Excess" rather
>> than the "Kurtosis Proper". In the Math Toolkit stats functions
>> we called these functions "kurtosis" and "kurtosis_excess", and
>> I'd favour similar naming here, or at least a strong warning to
>> the effect that there are multiple definitions in the literature -
>> otherwise unwary users will get into all kinds of trouble.
>
>
> Matthias?

OK with me.

>> This all leads on to the algorithms used for the mean and variance
>> etc: I assume that these are of the "naive" variety? Perfectly
>> satisfactory for simple cases, but there are well known
>> pathological situations where these fail.
>
>
> Very naive, yes.

Actually tag::variance is the naive implementation while tag::variance
(immediate) is a more accurate implementation.

>> BTW, I assume that there are no accumulators for Autocorrelation
>> Coefficient calculation?
>
>
> (This is the point at which I reveal my statistical naiveté.) Huh?

In the design phase we had discussed support for correlated samples,
and this will be a major extension project once the basic library for
uncorrelated samples is accepted. In addition to autocorrelation
coefficients with arbitrary lags, the accumulators for correlated
data will need to include autocorrelation time estimators and error
estimators for correlated samples. I plan to work on this over the
next year.
>
>
>> Which leads on to a quick comparison I ran against the "known
>> good" data here: http://www.itl.nist.gov/div898/strd/univ/
>> homepage.html The test program is attached, and outputs the
>> relative error in the statistics calculated. Each test is
>> progressively harder than the previous one, the output is:
>> PI data:
>> Error in mean is: 0
>> Error in SD is: 3.09757e-016
>> Lottery data:
>> Error in mean is: 6.57202e-016
>> Error in SD is: 0
>> Accumulator 2 data:
>> Error in mean is: 9.25186e-015
>> Error in SD is: 2.71685e-012
>> Accumulator 3 data:
>> Error in mean is: 5.82076e-016
>> Error in SD is: 0.0717315
>> Accumulator 4 data:
>> Error in mean is: 9.87202e-015
>> Error in SD is: -1.#IND
>> As you can see the calculated standard deviation gets
>> progressively worse, in the final case, the computed variance is
>> actually -2, which is clearly non-sensical (and hence the NaN when
>> using it to compute the standard deviation) :-(
>> I haven't tried to debug these cases: stepping through the
>> accumulator code is not something I would recommend to anyone
>> frankly (having tried it previously).
>> No doubt other torture tests could be constructed for the mean -
>> exponentially distributed data where you see one or two large
>> values, followed by a large number of very tiny values springs to
>> mind - this is the sort of situation that the Kahan summation
>> algorithm gets employed for.
>> I'm not sure what one should do about this. These certainly
>> illustrate the kinds of trap that the unwary can fall into. It's
>> particularly a problem for the "scientist in a rush" who may not
>> be either a C++ or statistical expert (whether (s)he thinks they
>> are or not!) but needs a library to quickly and reliably calculate
>> some stats. Better documentation would certainly help, but
>> guidance on (or provision of - perhaps by default) better
>> algorithms might be welcome as well.
>
>
> I'm hoping that Matthias can comment on your test and results.
> (Matthias, John posted his test in later msgs to the boost-devel
> list.) I implemented the accumulators very naively according to
> equations provided by Matthias. It's possible I needed to be more
> clever with the implementations, or that Matthias was unconcerned
> (or unaware) of the way the statistical accumulators would perform
> on torture tests such as these.
>
> The framework accommodates many different implementations of each
> feature. The idea was to allow quick-and-dirty approximations as
> well has slower, more accurate implementations. I would LOVE for a
> statistical expert to contribute some highly accurate statistical
> accumulators. Perhaps they can be based on the good work you've
> done in this area. (Nudge, nudge. :-)

I will take a look, but am a bit busy the next couple of days since I
have to pack everything for the whole family including our baby for a
one-month trip to Santa Barbara. I'll comment in more detail later.
For now it seems that you use the default variance implementation
which should be the naive estimator from the sum and sum of squares.
The variance(immediate) feature should be more accurate (see libs/
accumulators/doc/html/boost/accumulators/impl/
immediate_variance_impl.html) for the equation used.

Matthias


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk