Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2007-08-09 13:56:11


Zach Laine wrote:
> On 8/8/07, Eric Niebler <eric_at_[hidden]> wrote:
>>> I'm not really sure why dense_series<> does not allow floating point
>>> offsets. I understand that it is supposed to resemble std::vector.
>>> However, there is implicitly a relationship between the indices of the
>>> underlying vector and the times represented by the indices. It is
>>> possible (and quite convenient) for dense_series<> to hold a
>>> discretization_type that represents the start offset, and an interval,
>>> and perform the mapping for me. The lack of this mapping means that
>>> for dense time series of arbitrary discretization (say 0.5), I have to
>>> multiply all my times by 0.5 in user code when using the time series.
>>> I feel the library should take care of this for me; the fact that the
>>> underlying storage is vector-like should not force me to treat
>>> dense_series<> discretizations differently from all other series'
>>> discretizations.
>>
>> How does it force you to treat discretizations differently? Whether your
>> discretization is 1 or 0.5 or whether your underlying storage is dense
>> or sparse, it doesn't affect how you index into the series, does it? I'm
>> afraid I've missed your point.
>
> I wrote "discretization" when I meant to say "offset". Perhaps it's
> even that I'm merely confused, but this is actually making my point --
> I see a need for clarifying the relationships among disctretization,
> offset, and run.

OK, I understand. Yes, the docs can certainly be improved.

> My whole point was that (as I understand it) in
> order to represent an offset of 3.14 I need to keep an extrinsic value
> somewhere that tells me how to convert between 3.14 and the integer
> offset used in dense_series<>' runs. Is that accurate?

Yes, that's right because as it stands, dense_series is the only one
that doesn't allow FP offsets. That was a conservative design choice
because I wasn't sure at the time that I understood what it meant. It's
clear now that I need to have a rethink about FP offsets in general.

> If so, isn't
> this at odds with the other series types, which let me specify double
> values for offsets directly?

Correct.

>>> Nonetheless, it would be best if it were possible to
>>> specify that a sample exists at offset X, where X is double, int, or
>>> units::seconds, without worrying about any other details, including
>>> discretization. That is, discretization seems useful to me only for
>>> regularly-spaced time series, and seems like noise for
>>> arbitrarily-spaced time series.
>>
>> Discretizations are useful for coarse- and fine-graining operations that
>> resample the data at different intervals. This can be useful even for
>> time series that are initially arbitrarily-spaced.
>>
>> Sometimes you don't care to resampmle your data at a different
>> discretization, or call the integrate() algorithm. In those cases, the
>> discretization parameter can be completely ignored. It does tend to
>> clutter up the docs, but no more than, say, the allocator parameter
>> clutters up std::vector's docs.
>
> Is discretization then properly a property of the series itself?

You can think of discretization as an "SI unit" for series offsets. The
analogy isn't perfect because the discretization is more than just type
information -- it's a multiplicative factor that is logically applied to
offsets. More below.

> If
> the offsets of each sample are not related to the discretization, why
> have both in the same container? I find this very confusing.

Think of a dense series D with integral offsets and values representing
a quantity polled at 5ms intervals. D[0] represents the value of the
quantity at T=0ms, D[1] represents the value at 5ms, etc.... In this
case, the discretization is 5ms.

In series types that are not dense, having a non-unit discretization is
not as compelling. But its useful for consistency. If I replace a dense
series with a sparse series, I don't want to change how I index it. And
if I am given two series -- one dense and one sparse -- as long as their
discretizations are the same, I can traverse the two in parallel,
confident that their offsets represent the same position in the series.

> To
> accomodate the algorithms you mention above, would it be possible to
> simply say that I want to resample using a scale factor instead? What
> I'm getting at here is that discretization and offset seem to have a
> very muddy relationship. Doing everything in terms of offset seems
> clearer to me, and I don't yet see how this simplification loses
> anything useful.

Would you agree that, although you can do arithmetic on untyped numbers,
it's often a bad idea? Yes, you can resample an untyped series with an
untyped scale factor. But guarding against Mars-lander type units
mash-ups is just one use for discretizations. See above.

If discretization really doesn't matter for your application, you can
just not specify it. It will default to int(1). Or you can use the
lower-level range_run_storage classes. I have suggested elsewhere that
they can be pulled out into their own sub-library. They lack any notion
of discretization.

I'm still convinced discretizations are useful for a time series
library. This whole discussion should be cleaned up and put in the docs.

<snip>

>>> Some specific signal-processing usability concerns:
>>> - For many signal processing tasks, the time series used is too large
>>> to fit in memory. The solution is usually to use a circular buffer or
>>> similar structure to keep around just the part you need at the moment.
>>> The Boost.TimeSeries series types seem unable to accommodate this
>>> mode of operation.
>>
>> Not "unable to accommodate" -- making a circular buffer model the time
>> series concept would be fairly straightforward, and then all the
>> existing algorithms would work for it. But no, there is no such type in
>> the library at present.
>
> I'm glad to hear that this would be straightforward to do, and I think
> it's a must-have for signal processing folks.

If I read into your request a bit, I think a series implemented as a
circular buffer isn't actually the right approach for you. Rather, I
envision a single-pass series type -- one that spits out data points and
offsets as they are pulled from some stream. The circular buffer
rightfully belongs in a time series algorithm, because only the
algorithm knows how much history it needs to do its job. This is how the
rolling average algorithm I wrote works. After all, the series may be a
terrabyte, but only N samples are needed at any time to compute the
rolling average.

I think the circular-buffer-in-an-algorithm approach is very important.
I could imagine generalizing the rolling average algorithm so that it's
useful for all kinds of computation.

>>> - It might be instructive to both the Boost.TimeSeries developers and
>>> some of its potential users if certain common signal-processing
>>> algorithms were implemented with the library, even if just in the
>>> documentation. For example, how might one implement a sliding-window
>>> normalizer over densely populated, millisecond resolution data? What
>>> if this normalization used more than two time series to do it's work?
>>> It may well be possible with the current framework, but a) it's not
>>> really clear how to do it based on the documentation and b) the
>>> documenation almost seems to have a bias against that kind of
>>> processing.
>> I wonder why you say that. The library provides a 2-series transform()
>> algorithm that is for just this purpose.
>
> That's why I asked about "more than two time series". Such
> convolutions of multiple time series can be done in one pass, and
> Boost.TimeSeries does this admirably for N=2, but rewriting
> transform() for N>2 is a lot for most users to bite off.

Oh, yes. It gets pretty hairy. If we restrict ourselves to non-infinite
series (ones without pre- and post-runs) it is straightforward to
traverse N series in parallel. Even for infinite series it's doable with
a thin abstraction layer. The trouble comes when you want the extreme
performance that comes from taking advantage of algorithm specialization
for things like denseness or unit runs. Choosing an optimal parallel
series traversal becomes a combinatorial explosion. In these cases, I
think picking a traversal strategy that is merely Good instead of The
Best is probably the way forward.

>> As for the rolling window calculations, I have code that does that, and
>> sent it around on this list just a few weeks ago. I hope to add the
>> rolling average algorithm soon. It uses a circular buffer, and would
>> make a good example for the docs.
>
> I agree. This would be a great addition to the docs.

Not just for the docs. It should be a reusable algorithm in the library.

>>> As it stands, no. If there were clearly-defined relationships between
>>> samples and their extents and offsets; better support for large and/or
>>> piecewise-mutable time series; a rolling-window algorithm; and better
>>> customizability of coarse_grain() and fine_grain(), I would probably
>>> change my vote.
>>
>> I'm still not clear on what you mean by "clearly-defined relationships
>> between samples and their extents and offsets." The rest is all fair.
>> Rolling-window is already implemented, but not yet included.
>
> I was alluding to my issue with the relationships among
> discretization, offset, and run that I mentioned earlier.

Is it any clearer now?

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com
The Astoria Seminar ==> http://www.astoriaseminar.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk