
Boost : 
From: Hervé Brönnimann (hervebronnimann_at_[hidden])
Date: 20070201 20:34:10
Martin:
On Feb 1, 2007, at 1:29 PM, Martin Knoblauch Revuelta wrote:
> IMHO, the best way to do it is using a tree that is:
>  balanced
>  sorted (for binary search)
>  augmented with counters that are added bottomup, allowing random
> access (rank tree or statistic order tree)
>
> I send attached a sample program that solves this problem using a Rank
> List (proposed for inclusion in Boost some months ago):
> http://boost.cvs.sourceforge.net/*checkout*/boostsandbox/boost
> sandbox/libs/rank_list/doc/index.html
If the purpose is calculating the median, forgive me for being the
academic professor. But that is an overkill. You should maintain
two binary heaps of equal size (plus or minus one) with the lower
half being a max heap, and the upper half being a minheap. Upon
adding an element, you insert it to the appropriate heap, and if one
heap grows larger than the other by more than one, you can rebalance
the two heaps by transferring the min of the second heap into the
first heap, or the max of the first heap into the second. The
runtime is still O(log N) per element and the storage O(N) but all
storage is now sequential, using two std::vector, all the algorithms
are provided by std:: (you can even use an std::priority_queue out of
the box), and being much more compact, the constants will be much lower.
Note that there are also minmax heaps that work in O(log N) time per
insertion or extractmin or extractmax, and I should really provide
one in my boost.Minmax library. They also work by using an
std::vector storage and using an implicit binary tree. That enables
to have exact quantiles in O(t+log N) time per insertion.
Maintaining t quantiles means t separate vectors, so the overhead
would be up to double the storage (assuming doubling rule for
expanding vectors) and you'd have to compare that to a r/b tree
node. For elements such as integers, double still compares favorably
to binary tree nodes which has an overhead of three pointers and a
rank per integer.
> An accumulator storing a tree of this kind would be able to calculate
> not only the median, but any desired percentile in O(log N) time. The
> drawback is that:
>  accumulating a sample would take O(log N) too
>  O(N) extra storage
Agreed. If you don't want just the median, but have the rank
selection operation for any rank, a tree would be also my way to go.
However, in that case, I wouldn't use the accumulator library to
start with. Even for applying this to a tail of a distribution, I
humbly submit a sorted vector would be better (assuming tail is
small, i.e. constant size) due to compactness in the code, locality
of reference, etc. Then there is the issue of distinct numbers,
you'd have to decide for your application if N is much smaller than
the total number of multiplicities...
> Combining two accumulators of this kind (containing all unique values
> in a tree) would require O(N+M) time, where N and M are the numbers of
> unique values stored in the respective accumulators. Though, I guess
> that the median (or any percentile) might be calculated in
> O(square(log N)+square(log M)) without actually merging... I'm not
> sure...
You're correct. Given two sorted ranges (which your trees simulate
with an overhead for random access to log(N) or log(M)), by applying
binary search you can find the median of the union in an extra
multiplicative logarithmic factor.
> The extra storage is a very important drawback. If an algorithm like
> this one was ever included in the accumulators, it should be optional
> and disabled by default.
I agree with this. Once you provide it, people will use it. Better
make sure they darn know what they're getting into...
> Though, storing the values in a Rank List
> would allow very precise calculations. Note that a sequential sum is
> prone to cumulative errors, since the first samples pass through O(N)
> operations. In a Rank List every value would pass through about O(log
> N) operations. Regards, Martin
On a side note, maintaining the elements in order allows to recompute
the sum (from smaller to larger in absolute values) with the same
precision as the underlying floating point, i.e. 2^52 for IEEE
doubles. If precision is a real issue, that can be an incentive. On
the other hand, if N is small, I'd much rather maintain the sum in
exact representation (using for instance the fact that IEEE doubles
can be added exactly using a trick due to Decker, using seven IEEE
additions and maintaining the result as two nonoverlapping doubles).
Cheers,
 Hervé Brönnimann hervebronnimann_at_[hidden]
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk