Boost logo

Boost :

Subject: Re: [boost] proposed new library "histogram"
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2016-05-05 18:47:44


Hi Thijs,

On 5/5/16 4:36 PM, Thijs van den Berg wrote:
> Hi Hans,
>
> Interesting ideas.
> I have some algorithmic questions: I'd like to learn about the details
> behind the "just works" friendly objective so that I can decide if it will
> work for me -or not-, and under what circumstances. One reason I sometimes
> pick C++ instead of Python is because of performance, especially when I
> need to handle large datasets. In those cases the details often matter. So,
> if I was going to consider using it, it would be helpful to see performance
> metrics -e.g. compared to some naive alternative-.
I have a simple benchmark comparison against three classes in the ROOT
framework comparing the performance on 1-dimensional, 3-dimensional, and
6-dimensional data. I will add the performance results to the
documentation. I tested the benchmark on two different computers and got
qualitatively different results, so I cannot draw a general conclusion.
The speed is roughly similar. If you like to try for yourself, you could
check out the code and activate the CMake option BUILD_CHECKS, then run
the executable "nhistogram_speed" generated in the build directory.

I could and probably should also do a comparison against numpy.histogram.
If you have a particular kind of benchmark in mind, let me know, perhaps
I can implement it.

It is probably impossible to beat a specialized histogram type made
exclusively for 1d-data and a particular binning strategy, because a
dynamic solution like mine has additional overhead related to the
ability to define the binning algorithm and the number of dimensions at
run-time. Maybe an expert on this community sees a way to make it faster.

That being said, the performance gap is not big and which was explicitly
one of the design goals. I think that in most cases more CPU cycles are
spend to generate/read the data that is to be filled into the histogram
than used by histogram itself during the binning and counting.

> I've read that you computes variance: can that computation be
> switched-on/off (e.g. I might not need it)? Also, there are various online
> (single pass, weighted) variance algorithms: some a stable, other not.
> Which one have you implemented? Does is use std::accumulate? It would be
> nice to reassure numerically focused users about the level of quality of he
> internals.
I don't use std::accumulate, since it does not fit into my scheme.

If you do not fill the histogram with weighted events, there is no
overhead involved for the variance estimate.

If you fill the histogram with normal data, without using weights, then
the variance is computed on-the-fly when the user requests it via
histogram::variance(...). When no weights are involved, the variance
estimate per bin is taken to be equal to the count in that bin, a common
estimate based on Poisson theory. When weights were used during the
filling, the variance estimate is the sum of squared weights. Storing
the sum of squared weights for each bin requires twice the memory.

I did not implement an option to switch that off, since in a statistical
analysis, the variance estimate is as important as the actual count. I
think it is safe to assume that if you have a special case with weighted
data, you also want a variance estimate.

I will put in more details on these things into the Notes section of the
documentation.
> I would also like to see information about the computational and memory
> complexity about two other internal algorithms I think I saw mentioned:
>
> 1) automatically re-binning: when you modify bins do you split a single
> bin, or do you readjust *all* bin boundaries? Do you keep a sorted list
> inside each bin?
I think there is a misunderstanding. There is no automatic re-binning in
the sense that the number of bins along each axis of the histogram is
changed. The number of bins is always the same, bins are not split.

What grows is the size of the integer used to store a bin count. I start
with 1 byte for each bin, which covers counts from 0 to 255. Once you
try to fill a bin that has already 255 counts, the internal memory for
*all* bins is re-allocated and replaced by a memory block that uses 2
bytes for each bin. In addition to the memory reallocation this involves
a O(N) copy that is done in place (N is the number of bins). This
procedure is repeated if any of the bins exceeds its new storage maximum
of 65535, and so on.

Since the reallocation is done for all bins at once, this overhead does
not occur very often and does not introduce a significant performance hit.

I considered more elaborate storage strategies where the number of bytes
could differ for each bin, but all those that I could think of would
significantly decrease performance and might actually increase the
memory footprint for really large histograms. I can store an integer in
1 byte, but I already need 8 byte for a pointer, even if the pointer
points nowhere.

> 2) sparse storage: .. I know this is a complex field where lots of trade
> off can be made-. E.g. suppose I fill a 10-dimensional histogram with
> samples that (only) have elements on a diagonal -a potential worst case
> scenario for some methods would be-:
> for(int i: {1, 2, 3, 4, 5})
> h.fill([i,i,i,i,i,i,i,i,i,i])
>
> would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins
> -the outer product of ten axis, each with 5 bins-?
This histogram implements a dense storage strategy, for the sake of
performance. Writing a histogram with sparse storage is not my design goal.

However, there is a base class which handles the binning and the axis
types which a sparse histogram could re-use.

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk