|
Boost : |
Subject: Re: [boost] proposed new library "histogram"
From: Thijs van den Berg (thijs_at_[hidden])
Date: 2016-05-05 16:36:38
On 5 May 2016 at 00:21, Hans Dembinski <hans.dembinski_at_[hidden]> wrote:
> Hi everybody,
>
> I recently added a new library called "histogram" to the Boost Incubator.
> I would like to advertise it a little here in the hope to find a person
> interested in reviewing it. I hope that shameless self-advertisement is not
> going against some rule of this list, but I am sure you will let me know.
>
> My background is in analysis of big data in the fields of particle physics
> and astroparticle physics. Boost is very popular among my peers, since it
> is a free, high-quality, rich, and very well maintained collection of
> libraries. There is a growing number of tools to do statistical analysis in
> Boost and I think this project would fit in nicely, and fill a gap. We work
> with histograms a lot, so that's why my interest came from.
>
> I am a senior programmer in C++ and Python with 10 years of experience.
> Guiding development through code reviews and tickets, as well as taking on
> responsibility for continuous maintenance, are natural for me. Naturally, I
> am willing to commit free time to maintain the project should it be
> accepted, and do my share of the work in this community.
>
> I put a lot of thought and effort into this project, the rationale and my
> design choices are explained in the documentation, which I wrote according
> to the advice given at the Boost Incubator website. The project is feature
> complete from my side. What it needs now is the input from the Boost
> community to round off possible edges and to make the interface rich enough
> for everybody. I am good at considering the user perspective, but I cannot
> anticipate everyone's needs.
>
> In case you got interested, here are the links:
>
> Incubator link:
>
> http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582
>
> github link:
>
> https://github.com/HDembinski/histogram
>
> Best regards,
>
> Hans
>
>
>
Hi Hans,
Interesting ideas.
I have some algorithmic questions: I'd like to learn about the details
behind the "just works" friendly objective so that I can decide if it will
work for me -or not-, and under what circumstances. One reason I sometimes
pick C++ instead of Python is because of performance, especially when I
need to handle large datasets. In those cases the details often matter. So,
if I was going to consider using it, it would be helpful to see performance
metrics -e.g. compared to some naive alternative-.
I've read that you computes variance: can that computation be
switched-on/off (e.g. I might not need it)? Also, there are various online
(single pass, weighted) variance algorithms: some a stable, other not.
Which one have you implemented? Does is use std::accumulate? It would be
nice to reassure numerically focused users about the level of quality of he
internals.
I would also like to see information about the computational and memory
complexity about two other internal algorithms I think I saw mentioned:
1) automatically re-binning: when you modify bins do you split a single
bin, or do you readjust *all* bin boundaries? Do you keep a sorted list
inside each bin?
2) sparse storage: .. I know this is a complex field where lots of trade
off can be made-. E.g. suppose I fill a 10-dimensional histogram with
samples that (only) have elements on a diagonal -a potential worst case
scenario for some methods would be-:
for(int i: {1, 2, 3, 4, 5})
h.fill([i,i,i,i,i,i,i,i,i,i])
would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins
-the outer product of ten axis, each with 5 bins-?
Thanks,
Thijs
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk