Boost logo

Boost :

From: Jason D Schmidt (jd.schmidt_at_[hidden])
Date: 2003-02-27 00:01:42


Date: Tue, 25 Feb 2003 11:17:28 +0100
From: Hubert Holin <Hubert.Holin_at_[hidden]>
To: boost_at_[hidden]
Subject: [boost] Re: Any interest in a stats class
Message-ID: <Hubert.Holin-674145.11172725022003_at_[hidden]>
References: <20030224.222936.2848.1.jd.schmidt_at_[hidden]>
Precedence: list
Message: 3
 
Somewhere in the E.U., le 25/02/2003
 
    Bonjour
 
 
In article <20030224.222936.2848.1.jd.schmidt_at_[hidden]>,
Jason D Schmidt <jd.schmidt_at_[hidden]> wrote:
 
> I know this is well after the discussion on the stats class has ended,
> but I think I have a good idea here.
>
> Scott Kirkwood proposed a class that behaves something like this:
>
> stats myStats;
> for (int i = 0; i < 100; ++i) {
> myStats.add(i);
> }
> cout << "Average: " << myStats.getAverage() << "\n";
> cout << "Max: " << myStats.getMax() << "\n";
> cout << "Standard deviation: " << myStats.getStd() << "\n";
>
> In one of my classes in grad school, I found it quite useful and
> effecient to do statistics on the fly like this, so this stats class
> interests me. Anyway, Scott has already alluded to the point I'm about
> to make. I think it's important and useful for this stats class to
> integrate with the STL well. This example code was inspired by the
> PointAverage example from "Effective STL" p. 161:
>
> // this class reports statistics
> template <typename value_type>
> class stats
> {
> public:
> stats(const size_t n, const value_type sum, const value_type
> sum_sqr):
> m_n(n), m_sum(sum), m_sum_sqr(sum_sqr)
> {}
> value_type sum() const
> { return m_sum; }
> value_type mean() const
> { return m_sum/m_n; }
> value_type var() const
> { return m_sum_sqr - m_sum*m_sum/m_n; }
> value_type delta() const // aka, standard dev
> { return sqrt(var() / (m_n-1)); }
> private:
> value_type m_n, m_sum, m_sum_sqr;
> };
>
> // this class accumulates results that can be used to
> // compute meaningful statistics
> template <typename value_type>
> class stats_accum: public std::unary_function<const value_type, void>
> {
> public:
> stats_accum(): n(0), sum(0), sum_sqr(0)
> {}
> // use this to operate on each value in a range
> void operator()(argument_type x)
> {
> ++n;
> sum += x;
> sum_sqr += x*x;
> }
> stats<value_type> result() const
> { return stats<value_type>(n, sum, sum_sqr); }
> private:
> size_t n;
> value_type sum, sum_sqr;
> };
>
> int main(int argc, char *argv[])
> {
> typedef float value_type;
> const size_t n(10);
>
> float f[n] = {0, 2, 3, 4, 5, 6, 7, 8, 9, 8};
>
> // accumulate stats over a range of iterators
> my_stats = std::for_each(f, f+n,
> stats_accum<value_type>()).result();
>
> m = my_stats.mean();
> m = my_stats.delta(); // aka, standard deviation
>
> return 0;
> }
 
        In this example, what is the advantage over filling a valarray
and using a stat class which uses that as a constructor argument? You
would get sum for free, and hopefully (yeah, right...) operations on
valarrays could be hardware accelerated, whereas direct coding might not
be. That is, at least, one of the ideas I encoded in the file I just
uploaded on Yahoo (statistical_descriptor.h.gz).
 
> This seems to be pretty similar to what Scott has proposed, and it
turns
> out that this method is very fast. In my tests it has been nearly as
> fast as if we got rid of the classes and used a hand-written loop.
It's
> certainly much faster than storing the data in a std::valarray object,
> and using functions that calculate the mean & standard deviation
> separately. This is just a neat application of Scott's idea.
>
> I think this stats could be pretty useful for scientific computing, and
> in this example it works very well with the STL and has great
> performance. I'd like to see more code like this in Boost, but most of
> my work is numerical. Take my opinion or leave it.
>
> Jason Schmidt
 
        I agree with you that if the cardinal of the population is not
known then your approach is still useable whereas mine is not realistic.
But in that case you might have to reset the class periodically (if you
are doing statistics on the fly and want to just test a sample). Your
method might also be usefull when the amount of data is too big to be
properly placed at once in memory.
 
        So, we need classes for sequences, either in memory or via some
iterator, one dimensional or multi dimensional, and we also need classes
for (experimental) densities.
 
        We also need generators for the usual densities. Since we aready
have implementations of random, we should hitch our code to it. This
also ties in with the request for special functions such as erf.
 
        Since we now have uBlas, we can also try to aim for more complex
statistical constructs such as Gaussian Mixture Models, though to train
the Neural Networks which produce them, we also need good optimisation
code, which we lack completely at present (and which in turn usually
need some LA code).
 
        Anybody want to try to get the COOOL (http://coool.mines.edu/)
people aboard Boost?
 
    A bientot
 
            Hubert Holin
 
------------------------------

I think we're mostly in agreement about the approach to accumulating
statistics. In my past applications (usually Monte Carlo
integration/sims), I generated random numbers on the fly and accumulated
statistics in one loop, rather than keeping them all in memory at once.
Thus, I find this approach very useful.

I also agree that function objects that represent common probability
densities would be quite useful. I have recently coded up some for my
own use: gaussian, chi square, poisson, binomial, etc. However, some of
these required the use of special functions like gamma, beta (my special
functions are all very similar to the Numerical Recipes in C code). My
function objects include methods to return the mean, standard dev., and
cumulative distribution function values. This helps a lot for writing
statistical tests (t test, chi square test, etc.). I'll see if I can get
a look at the code you uploaded one of these days. Also, if you're
interested in any statistical stuff I might have, just holler.

Jason Schmidt

________________________________________________________________
Sign Up for Juno Platinum Internet Access Today
Only $9.95 per month!
Visit www.juno.com


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk