Boost logo

Boost :

Subject: [boost] [proposed][histogram]
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2017-04-12 09:37:43


Dear Boost community,

after another year of work, I hope to raise your interest once again in a much improved version of "histogram", a library proposed for inclusion in Boost.

https://github.com/HDembinski/histogram
http://blincubator.com/bi_library/histogram-2/?gform_post_id=1582

The library can be build either with b2 or cmake. It compiles successfully with several versions of clang and gcc and has 99 % test coverage (the remaining 1 % are basically un-hittable branches). Boost-style documentation is available.

The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.

A histogram implements a quantisation of a space of input values. The space is divided into non-overlapping cells. Instead of remembering the exact values of a tuple in that space, one just increases a counter in the corresponding cell. This is the lossy compression, you only remember the count in the cell and not the original values.

There are subtleties related to the counting if you want it to be safe and efficient. The library handles this for you. The standard policy implements intelligent counters which are fast, conserve memory, and are guaranteed to not overflow or get capped (capping happens when you use floating point numbers to count, like some implementations do). This is one of the two main feature of the library.

The other main feature is that the library allows you to seamlessly create one or multi-dimensional histograms using various schemes to divide the input space. A so-called axis class handles the division into cells along each dimension. There are several ways how you might want to define the cells that divide the input space, and library offers you interesting options. For example, if one of your inputs is an angle, the library provides a special axis class for periodic input values.

What's new:
- Static polymorphism: The original version of the library solely used dynamic polymorphism to support different axis classes. This variant still exists, but the current version also allows you to use static polymorphism, implemented internally with boost::mpl and boost::fusion. This provides a speed-boost, which makes this library faster than the open-source competitors I tested (CERN's ROOT framework, the GSL, and numpy.histogram), in addition to the enhanced flexibility.
- Arbitrary-precision counters: Counter overflow is now completely avoided by switching automatically to boost::multiprecision::cpp_int when the capacity of standard integer types is exhausted.
- Extensibility: The library is now much more extensible and configurable. All relevant behaviour can be exchanged by using another (possibly user-defined) policy. The library provides default trade-offs that should work for almost everybody, but you can chose your own trade-offs.

The new features were inspired by comments I got from the community. Now I am again at a point where I see the library in a very solid state.

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk