 # Boost :

From: Martin Knoblauch Revuelta (mkrevuelta_at_[hidden])
Date: 2007-02-01 13:29:29

> > From: Matthias Troyer <troyer_at_[hidden]>
> >> [..] even trickier will be median estimators where I currently
> >> even do not see how this should be done without storing most of the
> >> time series.

> james.jones_at_[hidden] wrote:
> > Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

On 1/31/07, Eric Niebler <eric_at_[hidden]> wrote:
> To calculate an exact median, you're correct. Of course, nothing stops
> you from writing an accumulator that calculates the exact median by
> storing all the samples seen so far. [..]

As it has been commented around here, you just need to store unique
values, and it can be done with O(log N) time per sample, and O(N)
extra storage (N is not the number of samples, but the number of
unique values).

IMHO, the best way to do it is using a tree that is:
- balanced
- sorted (for binary search)
- augmented with counters that are added bottom-up, allowing random
access (rank tree or statistic order tree)

I send attached a sample program that solves this problem using a Rank
List (proposed for inclusion in Boost some months ago):
http://boost.cvs.sourceforge.net/*checkout*/boost-sandbox/boost-sandbox/libs/rank_list/doc/index.html

The most interesting part is:
-----------------------------------------------------------------------
samples_container A; // Unique values only
samples_container::iterator p;

// [...]

sample = rand (); // Get next sample

found = A.binary_search (sample, p); // Search for it in the
// container
A.npsv_set_width (p, A.npsv_width(p)+1); // count ++
else // Not there?
A.insert (p, sample); // count = 1

total = A.npsv_width ();
// Go to the middle (50%) of the
p = A.npsv_at_pos (total/2); // Non Proportional Sequence
median = *p; // View (the median is there)
-----------------------------------------------------------------------

An accumulator storing a tree of this kind would be able to calculate
not only the median, but any desired percentile in O(log N) time. The
drawback is that:
- accumulating a sample would take O(log N) too
- O(N) extra storage

Combining two accumulators of this kind (containing all unique values
in a tree) would require O(N+M) time, where N and M are the numbers of
unique values stored in the respective accumulators. Though, I guess
that the median (or any percentile) might be calculated in
O(square(log N)+square(log M)) without actually merging... I'm not
sure...

The extra storage is a very important drawback. If an algorithm like
this one was ever included in the accumulators, it should be optional
and disabled by default. Though, storing the values in a Rank List
would allow very precise calculations. Note that a sequential sum is
prone to cumulative errors, since the first samples pass through O(N)
operations. In a Rank List every value would pass through about O(log
N) operations.

Regards,
Martin