Boost logo

Boost :

From: Hubert Holin (Hubert.Holin_at_[hidden])
Date: 2003-02-25 05:17:28


Somewhere in the E.U., le 25/02/2003

    Bonjour

In article <20030224.222936.2848.1.jd.schmidt_at_[hidden]>,
 Jason D Schmidt <jd.schmidt_at_[hidden]> wrote:

> I know this is well after the discussion on the stats class has ended,
> but I think I have a good idea here.
>
> Scott Kirkwood proposed a class that behaves something like this:
>
> stats myStats;
> for (int i = 0; i < 100; ++i) {
> myStats.add(i);
> }
> cout << "Average: " << myStats.getAverage() << "\n";
> cout << "Max: " << myStats.getMax() << "\n";
> cout << "Standard deviation: " << myStats.getStd() << "\n";
>
> In one of my classes in grad school, I found it quite useful and
> effecient to do statistics on the fly like this, so this stats class
> interests me. Anyway, Scott has already alluded to the point I'm about
> to make. I think it's important and useful for this stats class to
> integrate with the STL well. This example code was inspired by the
> PointAverage example from "Effective STL" p. 161:
>
> // this class reports statistics
> template <typename value_type>
> class stats
> {
> public:
> stats(const size_t n, const value_type sum, const value_type
> sum_sqr):
> m_n(n), m_sum(sum), m_sum_sqr(sum_sqr)
> {}
> value_type sum() const
> { return m_sum; }
> value_type mean() const
> { return m_sum/m_n; }
> value_type var() const
> { return m_sum_sqr - m_sum*m_sum/m_n; }
> value_type delta() const // aka, standard dev
> { return sqrt(var() / (m_n-1)); }
> private:
> value_type m_n, m_sum, m_sum_sqr;
> };
>
> // this class accumulates results that can be used to
> // compute meaningful statistics
> template <typename value_type>
> class stats_accum: public std::unary_function<const value_type, void>
> {
> public:
> stats_accum(): n(0), sum(0), sum_sqr(0)
> {}
> // use this to operate on each value in a range
> void operator()(argument_type x)
> {
> ++n;
> sum += x;
> sum_sqr += x*x;
> }
> stats<value_type> result() const
> { return stats<value_type>(n, sum, sum_sqr); }
> private:
> size_t n;
> value_type sum, sum_sqr;
> };
>
> int main(int argc, char *argv[])
> {
> typedef float value_type;
> const size_t n(10);
>
> float f[n] = {0, 2, 3, 4, 5, 6, 7, 8, 9, 8};
>
> // accumulate stats over a range of iterators
> my_stats = std::for_each(f, f+n,
> stats_accum<value_type>()).result();
>
> m = my_stats.mean();
> m = my_stats.delta(); // aka, standard deviation
>
> return 0;
> }

        In this example, what is the advantage over filling a valarray
and using a stat class which uses that as a constructor argument? You
would get sum for free, and hopefully (yeah, right...) operations on
valarrays could be hardware accelerated, whereas direct coding might not
be. That is, at least, one of the ideas I encoded in the file I just
uploaded on Yahoo (statistical_descriptor.h.gz).

> This seems to be pretty similar to what Scott has proposed, and it turns
> out that this method is very fast. In my tests it has been nearly as
> fast as if we got rid of the classes and used a hand-written loop. It's
> certainly much faster than storing the data in a std::valarray object,
> and using functions that calculate the mean & standard deviation
> separately. This is just a neat application of Scott's idea.
>
> I think this stats could be pretty useful for scientific computing, and
> in this example it works very well with the STL and has great
> performance. I'd like to see more code like this in Boost, but most of
> my work is numerical. Take my opinion or leave it.
>
> Jason Schmidt

        I agree with you that if the cardinal of the population is not
known then your approach is still useable whereas mine is not realistic.
But in that case you might have to reset the class periodically (if you
are doing statistics on the fly and want to just test a sample). Your
method might also be usefull when the amount of data is too big to be
properly placed at once in memory.

        So, we need classes for sequences, either in memory or via some
iterator, one dimensional or multi dimensional, and we also need classes
for (experimental) densities.

        We also need generators for the usual densities. Since we aready
have implementations of random, we should hitch our code to it. This
also ties in with the request for special functions such as erf.

        Since we now have uBlas, we can also try to aim for more complex
statistical constructs such as Gaussian Mixture Models, though to train
the Neural Networks which produce them, we also need good optimisation
code, which we lack completely at present (and which in turn usually
need some LA code).

        Anybody want to try to get the COOOL (http://coool.mines.edu/)
people aboard Boost?

    A bientot

            Hubert Holin


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk