Boost logo

Boost :

Subject: Re: [boost] proposed new library "histogram"
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2016-05-06 15:10:35


Some performance metrics, as requested.

For more information, please have a look at the updated docs.

Test system: Intel Core i7-4500U CPU clocked at 1.8 GHz, 8 GB of DDR3 RAM

================= ======= ======= ======= ======= ======= =======
distribution uniform normal
----------------- ------------------------- -------------------------
dimension 1D 3D 6D 1D 3D 6D
================= ======= ======= ======= ======= ======= =======
No. of fills 12M 4M 2M 12M 4M 2M
C++: ROOT [t/s] 0.127 0.199 0.185 0.168 0.143 0.179
C++: boost [t/s] 0.172 0.177 0.155 0.172 0.171 0.150
Py: numpy [t/s] 0.825 0.727 0.436 0.824 0.426 0.401
Py: boost [t/s] 0.209 0.229 0.192 0.207 0.194 0.168
================= ======= ======= ======= ======= ======= =======

Using boost::histogram in Python is considerably faster than using
numpy.histogram.

On 05/05/2016 04:36 PM, Thijs van den Berg wrote:
> On 5 May 2016 at 00:21, Hans Dembinski <hans.dembinski_at_[hidden]> wrote:
>
>> Hi everybody,
>>
>> I recently added a new library called "histogram" to the Boost Incubator.
>> I would like to advertise it a little here in the hope to find a person
>> interested in reviewing it. I hope that shameless self-advertisement is not
>> going against some rule of this list, but I am sure you will let me know.
>>
>> My background is in analysis of big data in the fields of particle physics
>> and astroparticle physics. Boost is very popular among my peers, since it
>> is a free, high-quality, rich, and very well maintained collection of
>> libraries. There is a growing number of tools to do statistical analysis in
>> Boost and I think this project would fit in nicely, and fill a gap. We work
>> with histograms a lot, so that's why my interest came from.
>>
>> I am a senior programmer in C++ and Python with 10 years of experience.
>> Guiding development through code reviews and tickets, as well as taking on
>> responsibility for continuous maintenance, are natural for me. Naturally, I
>> am willing to commit free time to maintain the project should it be
>> accepted, and do my share of the work in this community.
>>
>> I put a lot of thought and effort into this project, the rationale and my
>> design choices are explained in the documentation, which I wrote according
>> to the advice given at the Boost Incubator website. The project is feature
>> complete from my side. What it needs now is the input from the Boost
>> community to round off possible edges and to make the interface rich enough
>> for everybody. I am good at considering the user perspective, but I cannot
>> anticipate everyone's needs.
>>
>> In case you got interested, here are the links:
>>
>> Incubator link:
>>
>> http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582
>>
>> github link:
>>
>> https://github.com/HDembinski/histogram
>>
>> Best regards,
>>
>> Hans
>>
>>
>>
> Hi Hans,
>
> Interesting ideas.
> I have some algorithmic questions: I'd like to learn about the details
> behind the "just works" friendly objective so that I can decide if it will
> work for me -or not-, and under what circumstances. One reason I sometimes
> pick C++ instead of Python is because of performance, especially when I
> need to handle large datasets. In those cases the details often matter. So,
> if I was going to consider using it, it would be helpful to see performance
> metrics -e.g. compared to some naive alternative-.
>
> I've read that you computes variance: can that computation be
> switched-on/off (e.g. I might not need it)? Also, there are various online
> (single pass, weighted) variance algorithms: some a stable, other not.
> Which one have you implemented? Does is use std::accumulate? It would be
> nice to reassure numerically focused users about the level of quality of he
> internals.
>
> I would also like to see information about the computational and memory
> complexity about two other internal algorithms I think I saw mentioned:
>
> 1) automatically re-binning: when you modify bins do you split a single
> bin, or do you readjust *all* bin boundaries? Do you keep a sorted list
> inside each bin?
>
> 2) sparse storage: .. I know this is a complex field where lots of trade
> off can be made-. E.g. suppose I fill a 10-dimensional histogram with
> samples that (only) have elements on a diagonal -a potential worst case
> scenario for some methods would be-:
> for(int i: {1, 2, 3, 4, 5})
> h.fill([i,i,i,i,i,i,i,i,i,i])
>
> would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins
> -the outer product of ten axis, each with 5 bins-?
>
> Thanks,
> Thijs
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk