Boost logo

Boost :

Subject: Re: [boost] [proposed][histogram]
From: Oswin Krause (Oswin.Krause_at_[hidden])
Date: 2017-04-12 11:26:10


On 2017-04-12 12:34, Bjorn Reese via Boost wrote:
> On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
>
>> The library implements a histogram class (a highly configurable
>> policy-based template) for C++ and Python in C++11 code. Histograms
>> are a standard tool to explore Big Data. They allow one to visualise
>> and analyse distributions of random variables. A histogram provides a
>> lossy compression of input data. GBytes of input can be put in a
>> compact form which requires only a small fraction of the original
>> memory. This makes histograms convenient for interactive data analysis
>> and further processing.
>
> Given that the compression is lossy, I am wondering how it compares
> with
> a distribution estimator like:
>
> https://arxiv.org/abs/1507.05073v2
>
> A common use-case when collecting numerical data is to determine the
> quantiles. Boost.Accumulators contains an estimator (extended_p_square)
> for that.
>
> The advantage of such estimators are that they execute in constant time
> and with constant memory usage, where the constant depends only on the
> required precision.
>
> PS: I am aware that this is a non-trivial question, so I do not expect
> an answer.

Hi,

Simple answer: Histograms are not designed for estimating the quantile
function, but the pdf.

While it is true that a sufficiently good estimate of the pdf will give
you an estimate of the quantiles via the inverse of the cdf, the
obtainable precision depends on the size of the bins chosen for the
histogram.

On the other hand, if your data is multi-variate or your pdf
multi-modal, you will have a hard time using quantiles, while you could
still do for example outlier detection using histograms.

Best,
Oswin


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk