Boost logo

Boost :

From: Scott Kirkwood (scott_at_[hidden])
Date: 2003-02-11 11:23:38


Hi all,
I have a small family of statistics classes which I have used from time
to time. The one I use most often is simply called stats.
Here's an example of it's use:
...
    stats myStats;
    for (int i = 0; i < 100; ++i) {
        myStats.add(i);
    }
    cout << "Average: " << myStats.getAverage() << "\n";
    cout << "Max: " << myStats.getMax() << "\n";
    cout << "Standard deviation: " << myStats.getStd() << "\n";

I often put it in an existing loop to monitor some variables. For
example, I might use it with a timer class and get the average/std bytes
per second.

The actual add() function is quite simple and fast, here's the complete
code:

    void stats::add(const put_t& x)
    {
        m_Sum += x;
        m_Sum2 += x * x;

        if (x < m_Min)
        {
            m_Min = x;
            m_MinIndex = m_nCount;
        }

        if (x > m_Max)
        {
            m_Max = x;
            m_MaxIndex = m_nCount;
        }

        ++m_nCount;
    }

By keeping track of just these variables I can (at any point) calculate
the following statistics:
    return_t getAverage()
    return_t getStd()
    return_t getVariance()
    return_t getStdErrorOfMean()
    count_t getCount()
    return_t getSum()
    put_t getRange()
    put_t getMax()
    count_t getMaxIndex()
    put_t getMin()
    count_t getMinIndex()
    return_t getSumOfSquares()
    return_t getCoefficientOfVariation()
    return_t getRootMeanSquare()
Another nice feature of the class is that I can ask for these statistics
at any time. So, for example, I can printout the average number of bytes
per second every few seconds as the program executes.

If you happen to already have an array with values, you could use the
standard for_each() algorithm and operator() which is overloaded:
    myStats = std::for_each(values, values + nCount, myStats);

This post is really just to gauge interest, but I also have a few other
similar classes:
    Linear Regression - The "least-squares line" or "estimated
regression line" uses two (x, y) stats
    Frequency Distribution - Gathers frequency statistics from a stream
of values, you indicate the range and the number of buckets.
    Is Sorted - having seen the data (so far) does it appear to be
sorted, ascending, descending?
    HarmonicStats, CGeometricStats - similar to the stats above
    RollingAverage - Keeps track of the last N values.
And a stats class I would be willing to start on is:
    TopN - keep track of the Top or Bottom N values.

-Scott Kirkwood


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk