Boost logo

Boost :

Subject: Re: [boost] [histogram] Variance
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2018-10-01 12:04:35


Hi,

> On 26. Sep 2018, at 15:16, a.hagen-zanker--- via Boost <boost_at_[hidden]> wrote:
>
> Sorry to get back to the issue of variance. I am unsure about the justification of choosing the variance based on the Poisson distribution instead of the binomial distribution.
>
> My understanding is that the Poisson distribution is based on a distribution of a number of event given a continuous domain of opportunities (say a period of time). Whereas a binomial distribution is for a number of event for a discrete number of opportunities (say coin flips).
>
> Both seem appropriate in some use cases. However, the histogram class has no sense of the passage of time, whereas it does know the number of discrete opportunities (every time operator () is called). And the typical use of histogram seems to be to distribute a given number of samples over the bins that they belong to.
>
> So, would it not be more appropriate to estimate variance based on the binomial distribution?

the choice is between Poisson distribution and the multinomial distribution, and it is a bit subtle.
https://en.wikipedia.org/wiki/Multinomial_distribution
Either can be correct, depending on the scenario.

Poisson is correct, for example, when you monitor a random process for a while which produces some value x at random points in time with a constant rate. You bin the outcomes, and then stop monitoring at an arbitrary point in time. This is the right way to model many physics experiments. It is also correct if you make a survey with a random number of participants, i.e. when you pass the survey to a large number of people without knowing beforehand how many are going to respond.

Multinomial is correct, when there is a predefined fixed number of events, each with a random exclusive outcome, and you bin those outcomes. The important point is that the number of events is fixed before the experiment is conducted. This is the main difference to the previous case, where the total of events is not known beforehand. This would be correct, if you make a survey with a fixed number of participants, which you invite explicitly and don't start the analysis before all have return the survey.

If you have many bins in your histogram, the difference between the two becomes negligible. The variance for a multinomial count is n p (1 - p) where p is the probability to fall into this bin. The variance for a Poissonian count is p n, if you write it in the same way. If you have many bins, then p << 1 and n p (1 - p) = n p + O(p^2).

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk