Boost logo

Boost :

Subject: Re: [boost] [histogram] should some_axis::size() return unsigned or int?
From: Hans Dembinski (hans.dembinski_at_[hidden])
Date: 2018-11-30 12:27:54


> On 30. Nov 2018, at 12:58, Alexander Grund via Boost <boost_at_[hidden]> wrote:
>
>
> Am 30.11.18 um 11:36 schrieb Hans Dembinski:
>> You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence.
>>
>> In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested.
> Sounds reasonable: A range excluding the over/underflow bins and one including it.
>> An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense.
> Your code example was the following:
>
> for (unsigned i = 0; i < axis.size(); ++i) {
> auto x = h[i];
> // do something with bin
> }

`i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really includes all bins.

A histogram is technically a multi-dimensional array, but semantically it is more than that. In a histogram, each dimension of the multi-dimensional array also has an associated arrow of values. The axis types logically represent these arrows, which have indices and values. The axis itself is not very much like a container and I don't expect users to run STL algorithms on axis objects.

> So it looks like a container, although size and []-operator are in different instances (which feels weird, but ok)

I hope it is more clear now.

>>> Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind.
>> See explanation to Gavin why this is worse.
> Combining this with "Users usually ignore them[...] the `indexed` range adaptor should probably skip them by default" I do see the need for extra functions here too. Your argument against "high_bin()/low_bin()" was: Iteration must be split. But your above comment already suggests, that there are iterators which can cover the whole range. Could they solve this split-iteration-problem?

Yes, but we are diverging from my original question now. Users are recommended to use the range adaptors and iterators provided by the library. For the adaptors and iterators, all cases can be gracefully implemented.

But I know my potential users very well, they will also use integers as indices to loop over the histogram, because people in target community are often beginner programmers and using an integer index feels natural to them. The question is how to optimise the design for this use case and so that it does not clash with similar (mis)usage of STL containers.

>>> But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values.
>> And why would you want such an axis? It would be pointless and make the histogram operate slower.
>
> I was not saying this should be done. It would just be consistent. There are 2 dimensions:
> - open ranged bins yes/no
> - number of bins
> In my mind enabling open ranged bins does not ADD bins but makes the first and last go to +-inf:
>
> axis(4,0,10,"",uoflow_type::on) -> [-inf,0), [0,5), [5,10), [10, inf]
> axis(4,0,10,"",uoflow_type::off) -> [0,2.5), [2.5,5), [5,7.5), [7.5,10)

No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different.

A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far.

> Of course this might be confusing so default should be "off" as "users usually ignore them" so they are advanced things one does not generally need, right?

I can see that you did not read the rationale carefully yet.

Best regards,
Hans


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk