Boost logo

Boost :

From: Hervé Brönnimann (hervebronnimann_at_[hidden])
Date: 2007-01-31 20:39:38


Matthias: Forgive me, I haven't looked at your code yet. I wrote a
few accumulators, in my days, but I don't quite understand what you
mean by "combine". One may combine online algorithms, e.g. pass the
result of a filter to a summation or average accumulator. But I
don't quite grasp what you might do by combining a median into
another algorithm. Median is not really an online algorithms: It
maintains some quantities (e.g. a sample), but only at certain times
(periodically, or at the end) the result of the computation (median)
is queried. There is no continued output. Or you could imagine
feeding the median continuously to another accumulator (average
median of stream?) but it stretches the imagination?

For sampling, which has been a focus of my research recently, there
is of course the reservoir sampling, but a few other special-purpose
deterministic sampling algorithms have been devised. On my
publication web page, (http://photon.poly.edu/~hbr/publis-
themes.html, section 5 - shameless plug, sorry)) I give a few
sampling algorithms. I could program them, if I have the time. The
so-called Biased-L2 and DRS are much better at preserving
correlations / association rules for transactional data sets.

I also meant to note in passing that it is well known in database
community that sampling is bad for estimating sizes of join, and
other quantities dependent on correlation. There are sophisticated
ways (not too hard to program, though) to estimate them: the
celebrated Alon, Gibbons, Mattias, and Szegedy trick of projecting
onto a number of random vectors (if you're interested / inclined:
The Space Complexity of Approximating the Frequency Moments. J.
Comput. Syst. Sci. 58(1): 137-147 (1999), portal.acm.org/citation.cfm?
id=237823). They also fall within the area of data streams I
mentioned earlier, and it is an extremely powerful method for keeping
track of those quantities with a small memory footprint over a very
large stream.

Not sure if everyone is interested in these kinds of things, I
realize, but I hope I piqued the interest of a few.
Cheers,

--
Hervé Brönnimann
hervebronnimann_at_[hidden]
On Jan 31, 2007, at 4:54 PM, Matthias Troyer wrote:
> There are two types of accumulators that I can think of immediately:
>
> 1. approximate median and quantile estimators such as the p^2 methods
> cannot be combined but are very important for some applications
>
> 2. accumulators for correlates samples, e.g. autocorrelation
> estimators, or jackknife-bins cannot easily be combined. Instead of
> just merging the results one has to store the results from the
> individual accumulator_sets separately and combine them only in the
> final evaluations. Please keep in mind that correlated samples are
> very common.
>
> For 1. there is simply no way, and that is just it - one might either
> have to make this cause a compile-time error or drop the median and
> quantile estimators when combining. For 2. the problem is that
> combined accumulator sets need a different data structure than
> individual ones.
>
> Matthias
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/ 
> listinfo.cgi/boost

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk