# Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2007-08-10 09:55:12

On 8/9/07, Eric Niebler <eric_at_[hidden]> wrote:
> >>> Nonetheless, it would be best if it were possible to
> >>> specify that a sample exists at offset X, where X is double, int, or
> >>> units::seconds, without worrying about any other details, including
> >>> discretization. That is, discretization seems useful to me only for
> >>> regularly-spaced time series, and seems like noise for
> >>> arbitrarily-spaced time series.
> >>
> >> Discretizations are useful for coarse- and fine-graining operations that
> >> resample the data at different intervals. This can be useful even for
> >> time series that are initially arbitrarily-spaced.
> >>
> >> Sometimes you don't care to resampmle your data at a different
> >> discretization, or call the integrate() algorithm. In those cases, the
> >> discretization parameter can be completely ignored. It does tend to
> >> clutter up the docs, but no more than, say, the allocator parameter
> >> clutters up std::vector's docs.
> >
> > Is discretization then properly a property of the series itself?
>
>
> You can think of discretization as an "SI unit" for series offsets. The
> analogy isn't perfect because the discretization is more than just type
> information -- it's a multiplicative factor that is logically applied to
> offsets. More below.

It's the second part I have a problem with. More below.

> > If
> > the offsets of each sample are not related to the discretization, why
> > have both in the same container? I find this very confusing.
>
> Think of a dense series D with integral offsets and values representing
> a quantity polled at 5ms intervals. D[0] represents the value of the
> quantity at T=0ms, D[1] represents the value at 5ms, etc.... In this
> case, the discretization is 5ms.
>
> In series types that are not dense, having a non-unit discretization is
> not as compelling. But its useful for consistency. If I replace a dense
> series with a sparse series, I don't want to change how I index it. And
> if I am given two series -- one dense and one sparse -- as long as their
> discretizations are the same, I can traverse the two in parallel,
> confident that their offsets represent the same position in the series.
>
>
> > To
> > accomodate the algorithms you mention above, would it be possible to
> > simply say that I want to resample using a scale factor instead? What
> > I'm getting at here is that discretization and offset seem to have a
> > very muddy relationship. Doing everything in terms of offset seems
> > clearer to me, and I don't yet see how this simplification loses
> > anything useful.
>
> Would you agree that, although you can do arithmetic on untyped numbers,
> it's often a bad idea? Yes, you can resample an untyped series with an
> untyped scale factor. But guarding against Mars-lander type units
> mash-ups is just one use for discretizations. See above.
>
> If discretization really doesn't matter for your application, you can
> just not specify it. It will default to int(1). Or you can use the
> lower-level range_run_storage classes. I have suggested elsewhere that
> they can be pulled out into their own sub-library. They lack any notion
> of discretization.

You misunderstand me. I'm all for DiscretizationType (the template
parameter -- yay typesafety), and I'm all for discretization (the
value and associated fuctions) for dense series. What I'm against is
using discretization (the value) for non-dense series. (Note that we
have run into the Discretization/discretization naming ambiguity again
here.) I find it a confusing value to keep around, especially since
it can be simply ignored, as you pointed out in a previous email. A
data value that you can access and that is put forward as a
first-class element of the design -- that is also ignored -- suggests
a better design is possible. Here's what I suggest:

The Discretization template parameter becomes OffsetType, which is IMO
a more accurate name, and follows the RangeRun concepts. I will be
using that name below.
The "OffsetType discretization" ctor parameter, the "OffsetType
discretization()" accessors, and "void discretization(OffsetType d)"
mutators should only be applied to the dense_series<> type.
The user should be able to access any data in any type exclusively by
using offsets. This means that dense_series<> seamlessly handles the
mapping between (possibly floating point) offset and sample index I
requested initially.

This has these advantages:
The notion of discretization is not introduced into types for which it
has questionable meaning, e.g. piecewise_constant_series<>.
Offsets can be used exclusively, both on input (as before) and output
(as before, but now including dense_series<> with floating point
offsets).
Floating point and integral offsets are now treated much more uniformly.

Is this reasonable? Have I perhaps missed something fundamental?

> >>> - It might be instructive to both the Boost.TimeSeries developers and
> >>> some of its potential users if certain common signal-processing
> >>> algorithms were implemented with the library, even if just in the
> >>> documentation. For example, how might one implement a sliding-window
> >>> normalizer over densely populated, millisecond resolution data? What
> >>> if this normalization used more than two time series to do it's work?
> >>> It may well be possible with the current framework, but a) it's not
> >>> really clear how to do it based on the documentation and b) the
> >>> documenation almost seems to have a bias against that kind of
> >>> processing.
> >> I wonder why you say that. The library provides a 2-series transform()
> >> algorithm that is for just this purpose.
> >
> > That's why I asked about "more than two time series". Such
> > convolutions of multiple time series can be done in one pass, and
> > Boost.TimeSeries does this admirably for N=2, but rewriting
> > transform() for N>2 is a lot for most users to bite off.
>
>
> Oh, yes. It gets pretty hairy. If we restrict ourselves to non-infinite
> series (ones without pre- and post-runs) it is straightforward to
> traverse N series in parallel. Even for infinite series it's doable with
> a thin abstraction layer. The trouble comes when you want the extreme
> performance that comes from taking advantage of algorithm specialization
> for things like denseness or unit runs. Choosing an optimal parallel
> series traversal becomes a combinatorial explosion. In these cases, I
> think picking a traversal strategy that is merely Good instead of The
> Best is probably the way forward.

Does this mean that an N-series transform may be in the offing?

> >> As for the rolling window calculations, I have code that does that, and
> >> sent it around on this list just a few weeks ago. I hope to add the
> >> rolling average algorithm soon. It uses a circular buffer, and would
> >> make a good example for the docs.
> >
> > I agree. This would be a great addition to the docs.
>
> Not just for the docs. It should be a reusable algorithm in the library.

Good to hear.

> >>> As it stands, no. If there were clearly-defined relationships between
> >>> samples and their extents and offsets; better support for large and/or
> >>> piecewise-mutable time series; a rolling-window algorithm; and better
> >>> customizability of coarse_grain() and fine_grain(), I would probably
> >>> change my vote.
> >>
> >> I'm still not clear on what you mean by "clearly-defined relationships
> >> between samples and their extents and offsets." The rest is all fair.
> >> Rolling-window is already implemented, but not yet included.
> >
> > I was alluding to my issue with the relationships among
> > discretization, offset, and run that I mentioned earlier.
>
> Is it any clearer now?

It was always clear to me, I just disagreed with the design; I hope my
objections are clearer now.

Zach Laine