Boost logo

Boost :

From: Brook Milligan (brook_at_[hidden])
Date: 2007-08-15 17:47:58


In order to better motivate the need for the Boost Probability
library, I have updated the documentation, which is accessible at

         http://biology.nmsu.edu/software/probability/

Although this constitutes a new release, the only difference is in
documentation. As a result, the contents of v0.2.2 in the Boost Vault
still reflect exactly the most recent release and I haven't uploaded a
new copy.

The new motivational example is taken from the problem of ascertaining
the long-term trend of global climate. One database used to assess
this is available from the NOAA National Climate Data Center
(http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.php). It
contains monthly data for thousands of stations worldwide, in many
cases for decades. Today's version, for example, contains 590,543
records of mean temperature. A typical likelihood calculation
evaluating a model of climate would involve a product of likelihoods
across all of these records, almost certainly yielding a result on the
order of 10^{-600,000} or less. Such numbers cannot be handled using
typical floating point representations, so specialized solutions of
some form are required. The natural method is to accumulate the sum
of logarithms of likelihoods, rather than the product of likelihoods,
across the dataset. This keeps the values within suitable bounds, but
requires keeping track of the fact that different types of values
(probabilities, likelihoods, and log likelihoods) are being used
throughout a typical program. If these are all represented using
native types, such as double, it is easy to lose track of the fact
that they have different semantics.

A real solution of this problem would include modules taking care of
calculating the probability of each individual data record and modules
taking care of accumulating that information across the records. The
problem is complex enough that each of these responsibilities would
realistically be divided across many units and it would not be
unreasonable to expect development to be divided among many
programmers. In such situations it is all too easy to lose track of
what semantics apply to a specific value when the only information
available in the code is the data type (e.g., double) which provides
little help and some (perhaps untrustworthy) comments that may or may
not be read and in any case cannot affect the compiler.

Using the Probability library, one can encode the exact semantics
using the type system in a way that lends itself to generic
programming. The resulting clarity, safety, and maintainability is
retained regardless of how large the code base becomes and how the
operations are distributed across modules and/or programmers.

As a result of these features, I feel that this library makes a
significant contribution to solving a well-defined set of problems
that occur in certain types of scientific programming and modeling. I
hope you will take a serious look at its capabilities and provide me
with further feedback. I am especially interested in improving the
portability of the code and need testers with access to compilers
other than g++.

I look forward to your comments, suggestions, and general discussion.
Thank you.

Cheers,
Brook


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk