Boost logo

Boost :

Subject: Re: [boost] [histogram] Working with standard algorithms
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2017-11-20 11:51:24


Dear Bjørn,

> On 18. Nov 2017, at 17:08, Bjorn Reese via Boost <boost_at_[hidden]> wrote:
>
> I would like to see a better integration between boost.histogram and
> standard algorithms, especially those from <numeric>. Today they do not
> work together because the dereference operator of the axis iterator
> returns a pair (index and content) rather than the content.
>
> An example of where standard algorithms could be useful is calculating
> the cumulative distribution of a one-dimensional histogram, which ought
> to be a simple as:
>
> auto& axis = h.axis();
> std:partial_sum(axis.begin(), axis.end(), axis.begin());
>
> Another example is ranking a set of histograms based on the cosine
> similarity [1] using std::inner_product().

I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?

I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed.

How should we define such iterators so that they generalise to multiple dimensions? For example, it would be easy to implement new iterators which just iterates over the bin content:

auto h = make_static_histogram(axis::regular(3, 0, 1));
// fill […], then
std::vector<double> cumulative;
std::partial_sum(h.begin(), h.end(), cumulative.begin(), std::back_inserter(cumulative));

Questions:
- What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?

The current design keeps the internal order an implementation detail (I use column-major internally, because it is easier to compute the strides on the fly), the two choices are fortran-like (column-major) or C-like (row-major). Perhaps provide both with clear names so it is obvious? Or just row-major, because that's the standard in C++? These questions must have been already discussed during the design of Boost.MultiArray. I think I should provide a way to provide a MultiArray copy of the bin content, thus leveraging all the work that already went into defining MultiArray. Similarly one could provide a read-only view of the variance estimates. A read-only MultiArray view of the internal bin counts of the histogram is also possible for array_storage policy, but not for the default adaptive_storage policy.

- Writable iterators?
In your example, you use writable iterators, to change the content (and the meaning of the content) of the histogram itself. In my reply here, I use read-only iterators, because changing the content of the histogram seems to go against the semantics. A histogram is more specific than a general container of values, the content has a strong meaning that cannot be arbitrarily changed. To be precise, what should the variance estimate, returned by `histogram::variance(…)` be for a cumulative count? The variance estimate for cumulative count is a whole matrix of covariances, since the successive entries are not independent. There are other operations where it is not clearly defined how the variance estimate should be computed. I think the best way to avoid these problems is to provide read-only iterators.

Let me know what you think. I also added your message as an issue on github.

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk