 # Boost :

Subject: Re: [boost] Countertree
Date: 2012-11-20 14:24:17

> > On Mon, Nov 19, 2012 at 9:02 PM, Leo Goodstadt <leo.goodstadt_at_[hidden]>wrote:
> > 1) For counter trees, the median is just the N/2-th value in the
> > counter tree, where K is the size of the tree.
> > The Nth-percentile can be calculated similarly.
> > A rolling median with a fixed size window requires there to be an
> > insertion and deletion for each (most!) new value, so I am guessing
> > it should run in O(N log K) time, where N = size of data, K = size of
> > window
> >
> > IIUC, it is better to avoid modifying original data (size N), the rolling
> window of size K can be represented by a multiset based on an augmented
> tree. This method guarantees complexity of obtaining 1st median value for
> O(K log K) operations, which is the cost of construction of the rolling
> window, every subsequent median value can be obtained for O(log K)
> operations, when the window rolls. This approach should be more efficient.
>

Oops. Careless editing made me mix up my Ks and Ns, hopeless confusing everyone.

So given a large data set of size N and we want to calculate medians
across a rolling window
of size K, we only need to create a counter tree of size K (the window size).
The median is the K/2 value in the tree.

As you point out, the entire operation runs in O(N log K) time.

All sensible approaches to calculating medians I can think of take O(N
log K) time. The question comes down to what the constants are.

The great advantage of a counter tree is that we are not limited to
medians (the 50%ile value). We can collect for example, the values at
25%, 50%, and, 75%, and so on, at only marginally extra cost. I
presume that most of the cost lies in balancing the tree after
insertions/deletions and updating the counters.

The only shortcut is the trivial optimisation that if the value
entering the window is the same as the one being removed at the other
end, then the median obviously stays the same.

Leo