Boost logo

Boost :

Subject: Re: [boost] Formal Review of Proposed Boost.Histogram Library Starts TODAY
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2018-09-21 10:23:21


*** Please ignore the previous email, I started writing an answer, but then had to stop. Somehow that email was send accidentally.***

Dear Alex,

thank you very much for the review! Let me answer some comments.

> On 20. Sep 2018, at 16:51, a.hagen-zanker--- via Boost <boost_at_[hidden]> wrote:
> Based on that I have a couple of questions / remarks
>
> *Axis_type concept (should probably be AxisType)*

Ok (Steven also commented on this).

> Why must axis_type derive from labelled_base and iterator_mixin?

Good point, the label and the functionality provided by the iterator_mixin can be made optional.

https://github.com/HDembinski/histogram/issues/93

> My understanding is that concepts are ideally specified in terms of usage and behavior, rather than implementation. So for iterator_mixin, would it not be better to just require that the following works (with the usual semantics)

Ideally yes, "Although practicality beats purity." I will come back to this below.

> auto i = axis.begin();
> auto i = axis.end();
> auto i = axis.rbegin();
> auto i = axis.rend();
>
> You need to document what iterator concept you adhere to, e.g. I would promise/require OutputIterator with additional requirement of multipass.

I need to check, but I think these iterators can be made optional.

> Likewise for labelled base can you just require:
>
> boost::string_view str = axis.label();
> axis.label(str);
> auto a = axis.get_allocator(); // I don't really see the point of this to be honest

axis.get_allocator() can be optional. Some axis types need an allocator to get memory from the heap, other don't. I didn't want to store the allocator twice, so an axis that needs an allocator uses the label allocator.

> and for base:
>
> int a = size();
> int b = shape();
> int c = uoflow();
>
> shape() and uoflow() are not very intuitive names/functions to me, how about bool has_underflow() and bool has_overflow() to get the same information.

Do you have a suggestion for how to name "shape()" instead? I am not 100 % happy myself. extend() ? total_size() ?

I see the point with uoflow(). It should return the enum instead of an int. An implementation detail is leaking here.

> As a user implementing my own axis type, I would prefer just implementing those ten member functions over deriving from the two base classes. It would allow me to make an axis-type that is independent of boost / boost::histogram, and avoid complex inheritance. And, if I wanted, I could still derive from the base classes.

The only requirement I want to keep is that users have to derive from axis::base. I don't see how this limits anyone in their freedom to implement custom axis types, and it ensures that there is a standard layout in memory for the basic numbers that an axis has to provide to the histogram host. This enables optimisations that matter for the dynamic version of the histogram, where the axis types are stored in a variant-like type. If all axis types derive from axis::base, then I can write a visitor that static_casts to axis::base to query these numbers. We checked that gcc then removes the visitation code, because it repeatedly does the same thing. As compilers become better and better, this optimisation will also be applied by other compilers.

In short, for the sake of performance, it is mandatory that all axis types derive from axis::base, while deriving from iterator_mixin and axis::labeled_base should be optional.

> Further to those, you would just have
>
> auto b = axis[i];// where b is of type axis::bin_type
> auto i = axis.index(v); // where v is type axis::value_type
>
> For these you need to document how the under- and overflow bins are numbered

Agreed, this will become part of the specs.

> For the AxisType concept definition to be complete, you must also define a BinType concept for the bin_type and a SortableValue concept (I suppose) for the value_type.

Yes to BinType concept, while the SortableValue concept is only required for some axis types. The axis::category only requires the value_type to be equal-comparable and not sortable.

> Why did you chose to make the label a part of the axis? Of course there are many cases where we like the axis to have a label, but is that not true for many other classes, for instance std::vector, std::map, etc.

I don't know what you mean by "std::vector, std::map". I agree with you that you need to be able to make a custom axis without a label if you wish.

> Making this a default seems to be against the stated idea of "you don't pay for what you don't need". It seems to me that you require me to use a pair<axis, string> where I would be happy to use an axis.

All the internal axis types have a builtin label, which is a design decision based on the experience that labels are generally useful to a large audience. Once you can write your own axis types that do not use labels, I think this point is settled, right?

> Why is it necessary for axes to be comparable? And is is not asking for trouble to also compare the axis title (considering lower/upper case typos, etc.)

Histograms should be addable, but you should only be able to add histograms with the same logical layout. It makes no sense to add two histograms which have the same number of bins, but different axis types. Or further, which have the same axis types, but different axis labels. The label is a way to make a logical distinction between two otherwise identical axis types, therefore it is included in the comparison. How exactly do you foresee trouble?

> *Not documenting the histogram class*
>
> I could not find the documentation/reference for the histogram class. I suppose this is on purpose because there are two implementations that have a common interface? In that case I would document the common interface as a concept too.

The histogram template is not included in the reference, but that is a mistake. It should be there, it is not an implementation detail (reference generator puts *unspecified*, but that's wrong). Perhaps someone with more experience with the doxygen scripts in Boost can help?

I merged the two implementations into one templated class a while ago, which is much prettier, and updated the docs. Since you found an obsolete reference to two implementations, could you please point me to where you read that?

> In the tutorial there is the following part to create histogram by copying or moving axes from a std::vector of axes.
>
> auto h1 = make_dynamic_histogram(v.begin(), v.end()); // copying axes
> auto h2 = bh::histogram<decltype(v)>(std::move<v>); // moving axes
>
> Now, h1 seems very reasonable to me, but for h2 you now need to use the undocumented histogram class. It is not clear to me what the two template arguments are for in histogram<A,S>. It strikes me as strange that A should be a vector<axis::any_std>, because having the axes stored in a vector seems to be an implementation detail irrelevant to the user.

It is not an implementation detail. Histogram supports two ways of storing axis types, via std::tuple or via std::vector<axis::any<…>>. These two options are part of the contract. Anyway, these line of thought is based on the assumption that the histogram itself is unspecified, which is not the case. Sorry about the error in the reference.

> Perhaps this could be avoided by providing factory functions, that take a Range of axes instead of pairs of iterators. e.g.
>
> template<typename AxisRange> make_dynamic_histogram_from_range(AxisRange&& axes);
> template<typename Storage, typename AxisRange> make_dynamic_histogram_from_range(Storage&& storage, AxisRange&& axes);

Optional support for ranges or std::span can be added at a later point, this is just syntactic sugar.
https://github.com/HDembinski/histogram/issues/94

> as this is Perfect Forwarding, you can automatically decide whether to move or copy. Furthermore, the user is not limited to providing axes in a std::vector

I see no need for the user to use anything other std::vector or std::tuple to store axis types. The implementation could be made more flexible in that regard, but there is simply nothing to gain by doing so, only an increase in code complexity.

> *Not having an overflow bin for Circular axes*
>
> How can this type of axes deal with NaN input?

Good point! Passing NaN to a circular axis is currently UB. There are two solutions.

1) Allow an optional overflow bin to count NaNs (off by default)
2) Fail an assert when NaN is passed

I prefer option 1.
https://github.com/HDembinski/histogram/issues/95

> *Fill Histogram*
>
> The functional style example seems overly complicated and apparently (according to your comments) depends on smartness of the compiler, would the following not be preferred?
>
> auto h2 = make_static_histogram(
> bh::axis::regular<>(8, 0, 4),
> bh::axis::regular<>(10, 0, 5));
> for(auto&& v : input_data) h2(v);
>
> I know this is not strictly functional, but you could easily provide another factory function to do this (i.e. something like "make_and_fill_histogram").

I am not sure what you mean, but I think you want me to change the example? I would not use functional style myself for the same reasons you mention, but some people requested this and so I put an example in the guide to show it is possible and discuss the caveats. I think this is a good solution to the documentation problem.

> *Access bin counts*
>
> It seems unfortunate that the methods to access the indices of a bin are methods of the iterator rather than the value. This would preclude most uses of range-based for-loops with histograms. I see that to change this could be tricky, depending on the iterator concept you wish to support. Perhaps it is an option to provide separate Ranges (Views) for histograms that iterate only over values; over values and variances, over indices and values and over indices, values and variances.
>
> You could have something like this:
>
> for(auto&& v : h) // alternatively: for(auto&& v : indices_and_values_view(h))
> {
> std::cout << v.idx(0) << ' ' << v.idx(1) << ' ' << v.value() << std::endl;
> }
>
> Instead of:
>
> for (auto it = h.begin(); it != h.end(); ++it)
> {
> std::cout << it.idx(0) << ' ' << it.idx(1) << ' ' << it->value() << std::endl;
> }

I also like range-based for-loops, and I considered this solution, but then decided against it. Providing these views adds complexity to the interface and the implementation just to enable a bit of syntactic sugar. The current iterator provides all extra information you may want to query when you iterate over a multi-dimensional histogram, but also behaves like a primitive iterator over the counter values, so that simple algorithms like std::accumulate work. STL algorithms do not work anyway with this data structure when you have to do something more complex with the content, because they were not designed to handle multi-dimensional data.

But somehow I guess users will not be happy unless they can use range-based for-loops, so I will reconsider.
https://github.com/HDembinski/histogram/issues/96

> *Functionality*
>
> The documentation says that automated bin boundaries are not an option, because there is no consensus on how to do this. While I understand you would not want to take this on, I would think a library that includes a number of methods to do this with the possibility to add your own would be very valuable and lack of consensus is no real argument. For instance, GIS packages do this to create legends for geographical maps, and I would like to have this capability at my fingertips.

Any form of automated bin boundary computation that requires multi-pass on the data will not be handled by the library.

I am currently considering simple forms of automated bin boundary computation that can be implemented in single-pass, like automatically adding bins to a regular axis when a value falls outside the current range.

> It would be useful to be able to aggregate bins with an intuitive interface, e.g. if I have the population age distribution in 1-year intervals, aggregate it to 5-year intervals. This would be a generalization of the reduce_to function.

I plan to add a support library with common algorithms. "reduce_to" will become a free function and move there, and an option to aggregate bins like you suggest will also be added, called "rebin". Perhaps these two can/should be combined.

https://github.com/HDembinski/histogram/issues/98

> I like that you included the option to combine storages. Have you also considered the option to subtract storages / samples? This would make histogram more suitable for moving window type analysis.

Subtracting histograms is currently not implemented, because I originally wanted to maintain that bin entries are always non-negative. But it should be optionally available if the storage supports negative entries.

https://github.com/HDembinski/histogram/issues/99

> *Conclusion*
>
> I vote for inclusion in Boost because it offers fundamental and very useful functionality.

Thanks!

> - What is your evaluation of the implementation?
> -- Didn't look in detail. To me it seemed more complex than necessary, but perhaps that is where the good performance comes from.

If you have specific suggestions on how to simplify things, let me know. In general, the more use and corner cases the library has to support, the more complex its implementation gets.

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk