Boost logo

Boost :

From: Hervé Brönnimann (hervebronnimann_at_[hidden])
Date: 2007-01-31 20:23:45


On Jan 31, 2007, at 10:38 AM, Hans Meine wrote:
> On Wednesday, 31. January 2007 16:19, Ben FrantzDale wrote:
>>> Isn't this already a problem with sequences? Suppose you're
>>> storing the
>>> current median, and a new value comes along. What's the new median -
>>> without checking all the previous values? There are statistical
>>> estimates
>>> for this, but I don't know any exact way other than essentially
>>> resorting
>>> the data and checking the new median.
>>
>> Good point. Combining partial results may not be possible for all
>> accumulated statistics or there may be a cost to making a statistic
>> "combinable". Perhaps a refinement of the Accumulator concept
>> would be
>> needed.
>>
>> The idea of the Accumulator library is to implement online
>> algorithms; is
>> there a corresponding word for online algorithms for which partial
>> results
>> can be combined?
>
> Actually, I don't see any "online algorithms" whose partial results
> cannot be
> combined.
>
> IIUC, what James pointed out is exactly that the median is not an
> online
> algorithm. AFAICS, the "partial result" (in terms of inner
> representation)
> of a median accumulator consists of the (possibly sorted) complete
> input seen
> so far. (I would love to be corrected, but I cannot imagine a
> solution.)

Just to contribute to the discussion: The keyword for this sort of
algorithms is Data Streams (where your memory storage must be small,
e.g bounded or very slowly growing, no matter how many elements the
stream has), and they're now part of any theoretical computer science
conference (database conferences also sport a few of these).
Theoretical computer science tells you that indeed, it is not
mathematically possible to maintain the median of a stream of N
values without Omega(N) storage, essentially remembering all the values.

BTW, there are faster algorithms for finding the median than sorting,
using quickselect algorithm for instance (essentially
std::nth_element). It's also possible in linear storage to maintain
the median at logarithmic cost per new element.

Online algorithms is not a well-defined term in computer science, but
online or competivity analysis of algorithms refers to something
else: the online competitiveness ratio of an algorithm that processes
a stream online is the ratio between its solution, and the solution
of the best offline algorithm (which knows all the values in
advance). I.e. it measures what you lose by not knowing the future.

Examples of data stream algorithm are of course the average (you only
need to maintain sum and count), or higher moments (sum of squares,
etc.) with all the pitfalls that have been detailed with John's
torture tests earlier in this stream.

But if you don't insist on the median, you can get approximate
medians (of rank within (1/2+/-epsilon)) for a data stream cheaply,
using a random sample of size roughly 1/epsilon (a bit larger
actually, to take care of the random deviations). There are
deterministic algorithm (by Manku, Motwani, or Greenwald and Khanna.

Another simple problem is: maintain the count of *unique* items (i.e.
eliminating the redundant ones) over a stream of N elements. At
first it seems you need to maintain a hash-set, or some
representation, of all the distinct elements. But it's possible to do
this approximately as well, with very small storage.

Another fascinating sort of problems happen when you want to maintain
e.g. the average of the last M elements over a stream of N elements,
with M fixed, referred to as "sliding window" problems. Or when
elements can be cancelled from the stream.

A good but somewhat dated survey on data stream algorithms for those
who want/have time to delve deeper is given by S. Muthukrishnan:
http://athos.rutgers.edu/~muthu/stream-1-1.ps (in Postscript). Much
food for thought. Muthu used to be at AT&T as well as Rutgers and
now works at Google (hint, hint...)

Cheers,

--
Hervé Brönnimann
hervebronnimann_at_[hidden]

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk