Boost logo

Boost :

Subject: Re: [boost] [proposed][histogram]
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2017-04-12 15:51:48


Hi Bjorn,

> Given that the compression is lossy, I am wondering how it compares with
> a distribution estimator like:
>
> https://arxiv.org/abs/1507.05073v2

I have to read the reference carefully, which is quite interesting, but I think the scope of such a density estimator is different.

Histograms are conceptually simple, and simplicity is sometimes a plus. If you really want to have an estimator of the data pdf, then other algorithms may be better. Histograms can be transformed into an estimator of the pdf, but that's not their primary use case in my experience.

In my field, particle physics, we are usually not interested in the data pdf itself. We come up with a theoretical model pdf on our own, which depends on some parameter(s) of interest (e.g. the mass of a new particle). We then adjust this parameter until the theoretical model fits the data. This can done by maximising the likelihood of the model in view of the data. If the data set is big, then it is more practical to use a histogram instead of the original data. We then maximise the likelihood of obtaining such a histogram.

For this purpose, histograms are great, because they have clear properties and the analysis is straight-forward. The counts in the cells follow Poisson distributions, the stochastic fluctuations are independent in each cell. Neither is true for smooth density estimators, which makes them unsuitable for model fitting.

> A common use-case when collecting numerical data is to determine the
> quantiles. Boost.Accumulators contains an estimator (extended_p_square)
> for that.

I had a look into Boost.Accumulators, and my impression was that the algorithms are for one-dimensional data only. The histogram library allows you to handle multi-dimensional input. This goes in addition to what why I wrote above, about the necessity to statistically model the histogram counts.

In summary, the histogram library is not a particularly clever density estimator, but it tries to be the most efficient and convenient implementation of a classical histogram.

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk