
Boost : 
From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 20070810 09:55:12
On 8/9/07, Eric Niebler <eric_at_[hidden]> wrote:
> >>> Nonetheless, it would be best if it were possible to
> >>> specify that a sample exists at offset X, where X is double, int, or
> >>> units::seconds, without worrying about any other details, including
> >>> discretization. That is, discretization seems useful to me only for
> >>> regularlyspaced time series, and seems like noise for
> >>> arbitrarilyspaced time series.
> >>
> >> Discretizations are useful for coarse and finegraining operations that
> >> resample the data at different intervals. This can be useful even for
> >> time series that are initially arbitrarilyspaced.
> >>
> >> Sometimes you don't care to resampmle your data at a different
> >> discretization, or call the integrate() algorithm. In those cases, the
> >> discretization parameter can be completely ignored. It does tend to
> >> clutter up the docs, but no more than, say, the allocator parameter
> >> clutters up std::vector's docs.
> >
> > Is discretization then properly a property of the series itself?
>
>
> You can think of discretization as an "SI unit" for series offsets. The
> analogy isn't perfect because the discretization is more than just type
> information  it's a multiplicative factor that is logically applied to
> offsets. More below.
It's the second part I have a problem with. More below.
> > If
> > the offsets of each sample are not related to the discretization, why
> > have both in the same container? I find this very confusing.
>
> Think of a dense series D with integral offsets and values representing
> a quantity polled at 5ms intervals. D[0] represents the value of the
> quantity at T=0ms, D[1] represents the value at 5ms, etc.... In this
> case, the discretization is 5ms.
>
> In series types that are not dense, having a nonunit discretization is
> not as compelling. But its useful for consistency. If I replace a dense
> series with a sparse series, I don't want to change how I index it. And
> if I am given two series  one dense and one sparse  as long as their
> discretizations are the same, I can traverse the two in parallel,
> confident that their offsets represent the same position in the series.
>
>
> > To
> > accomodate the algorithms you mention above, would it be possible to
> > simply say that I want to resample using a scale factor instead? What
> > I'm getting at here is that discretization and offset seem to have a
> > very muddy relationship. Doing everything in terms of offset seems
> > clearer to me, and I don't yet see how this simplification loses
> > anything useful.
>
> Would you agree that, although you can do arithmetic on untyped numbers,
> it's often a bad idea? Yes, you can resample an untyped series with an
> untyped scale factor. But guarding against Marslander type units
> mashups is just one use for discretizations. See above.
>
> If discretization really doesn't matter for your application, you can
> just not specify it. It will default to int(1). Or you can use the
> lowerlevel range_run_storage classes. I have suggested elsewhere that
> they can be pulled out into their own sublibrary. They lack any notion
> of discretization.
You misunderstand me. I'm all for DiscretizationType (the template
parameter  yay typesafety), and I'm all for discretization (the
value and associated fuctions) for dense series. What I'm against is
using discretization (the value) for nondense series. (Note that we
have run into the Discretization/discretization naming ambiguity again
here.) I find it a confusing value to keep around, especially since
it can be simply ignored, as you pointed out in a previous email. A
data value that you can access and that is put forward as a
firstclass element of the design  that is also ignored  suggests
a better design is possible. Here's what I suggest:
The Discretization template parameter becomes OffsetType, which is IMO
a more accurate name, and follows the RangeRun concepts. I will be
using that name below.
The "OffsetType discretization" ctor parameter, the "OffsetType
discretization()" accessors, and "void discretization(OffsetType d)"
mutators should only be applied to the dense_series<> type.
The user should be able to access any data in any type exclusively by
using offsets. This means that dense_series<> seamlessly handles the
mapping between (possibly floating point) offset and sample index I
requested initially.
This has these advantages:
The notion of discretization is not introduced into types for which it
has questionable meaning, e.g. piecewise_constant_series<>.
Offsets can be used exclusively, both on input (as before) and output
(as before, but now including dense_series<> with floating point
offsets).
Floating point and integral offsets are now treated much more uniformly.
Is this reasonable? Have I perhaps missed something fundamental?
> >>>  It might be instructive to both the Boost.TimeSeries developers and
> >>> some of its potential users if certain common signalprocessing
> >>> algorithms were implemented with the library, even if just in the
> >>> documentation. For example, how might one implement a slidingwindow
> >>> normalizer over densely populated, millisecond resolution data? What
> >>> if this normalization used more than two time series to do it's work?
> >>> It may well be possible with the current framework, but a) it's not
> >>> really clear how to do it based on the documentation and b) the
> >>> documenation almost seems to have a bias against that kind of
> >>> processing.
> >> I wonder why you say that. The library provides a 2series transform()
> >> algorithm that is for just this purpose.
> >
> > That's why I asked about "more than two time series". Such
> > convolutions of multiple time series can be done in one pass, and
> > Boost.TimeSeries does this admirably for N=2, but rewriting
> > transform() for N>2 is a lot for most users to bite off.
>
>
> Oh, yes. It gets pretty hairy. If we restrict ourselves to noninfinite
> series (ones without pre and postruns) it is straightforward to
> traverse N series in parallel. Even for infinite series it's doable with
> a thin abstraction layer. The trouble comes when you want the extreme
> performance that comes from taking advantage of algorithm specialization
> for things like denseness or unit runs. Choosing an optimal parallel
> series traversal becomes a combinatorial explosion. In these cases, I
> think picking a traversal strategy that is merely Good instead of The
> Best is probably the way forward.
Does this mean that an Nseries transform may be in the offing?
> >> As for the rolling window calculations, I have code that does that, and
> >> sent it around on this list just a few weeks ago. I hope to add the
> >> rolling average algorithm soon. It uses a circular buffer, and would
> >> make a good example for the docs.
> >
> > I agree. This would be a great addition to the docs.
>
> Not just for the docs. It should be a reusable algorithm in the library.
Good to hear.
> >>> As it stands, no. If there were clearlydefined relationships between
> >>> samples and their extents and offsets; better support for large and/or
> >>> piecewisemutable time series; a rollingwindow algorithm; and better
> >>> customizability of coarse_grain() and fine_grain(), I would probably
> >>> change my vote.
> >>
> >> I'm still not clear on what you mean by "clearlydefined relationships
> >> between samples and their extents and offsets." The rest is all fair.
> >> Rollingwindow is already implemented, but not yet included.
> >
> > I was alluding to my issue with the relationships among
> > discretization, offset, and run that I mentioned earlier.
>
> Is it any clearer now?
It was always clear to me, I just disagreed with the design; I hope my
objections are clearer now.
Zach Laine
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk