Boost logo

Boost :

From: Martin Knoblauch Revuelta (mkrevuelta_at_[hidden])
Date: 2007-02-02 09:37:21


> On Feb 1, 2007, at 1:29 PM, Martin Knoblauch Revuelta wrote:
> > IMHO, the best way to do it is using a tree [..]

On 2/2/07, Hervé Brönnimann <hervebronnimann_at_[hidden]> wrote:
> If the purpose is calculating the median, forgive me for being the
> academic professor.

Oh, please, be.

> But that is an overkill. You should maintain
> two binary heaps of equal size (plus or minus one) with the lower
> half being a max heap, and the upper half being a min-heap.
> [..] being much more compact, the constants will be much lower.

You are right.

> Note that there are also min-max heaps that work in O(log N) time per
> insertion or extract-min or extract-max, [..] That enables
> to have exact quantiles in O(t+log N) time per insertion.
> Maintaining t quantiles [..]

Sounds really good.

> [..] the overhead
> would be up to double the storage (assuming doubling rule for
> expanding vectors)

Does it depend on t (the number of quantiles)?

> and you'd have to compare that to a r/b tree
> node. For elements such as integers, double still compares favorably
> to binary tree nodes which has an overhead of three pointers and a
> rank per integer.

Oh yes, and in my solution you'd have to add five fields more (two
pointers and three integers).

> > [..] not only the median, but any desired percentile in O(log N) time[..]

> Agreed. If you don't want just the median, but have the rank
> selection operation for any rank, a tree would be also my way to go.

That's why I posted my message. It can be slightly off topic, since it
requires O(N) storage, but it was mentioned several times and I
couldn't resist the temptation... :)

> However, in that case, I wouldn't use the accumulator library to
> start with.

Yes, storing samples (or just unique values) sounds to me contrary to
what accumulators are intended to be.

> [..]
> Then there is the issue of distinct numbers,
> you'd have to decide for your application if N is much smaller than
> the total number of multiplicities...

Of course, this depends on the number of possible values, the nature
of the distribution...

By the way, you could do the same thing with the heaps (or with the
sorted vector). I mean, storing just unique values, together with
counters.

> > [..] precise calculations. [..]

> On a side note, maintaining the elements in order allows to recompute
> the sum (from smaller to larger in absolute values) with the same
> precision as the underlying floating point, i.e. 2^-52 for IEEE
> doubles.

Good point. Though, "recompute the sum" takes O(N) time. If the sum is
required after every insertion...

> If precision is a real issue, that can be an incentive. On
> the other hand, if N is small, I'd much rather maintain the sum in
> exact representation (using for instance the fact that IEEE doubles
> can be added exactly using a trick due to Decker, using seven IEEE
> additions and maintaining the result as two non-overlapping doubles).

Sounds interesting. I'll try to read something about it.

Thanks for your detailed and illustrating response.

Regards,

Martin


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk