|
Boost : |
From: Eric Niebler (eric_at_[hidden])
Date: 2007-08-08 21:02:59
Thanks for taking the time to review the library.
Zach Laine wrote:
> I want to say at the outset that I work with a lot of people that do
> signal processing, and so that is the main use case my colleagues and
> I would have for using Boost.TimeSeries. That has definitely colored
> my view of Boost.TimeSeries, for what that's worth.
>
> * What is your evaluation of the design?
>
> Why is a series with only one nonzero element called a delta series?
> I would expect a delta series to be the elementwise differences
> between two series or some such. In fact, a little Googling revealed
> many references to delta series that were the differences of two
> series, but none like delta_series<>. Is this just my ignorance of
> the financial uses of time series?
The name comes from Dirac delta function:
http://en.wikipedia.org/wiki/Dirac_delta_function
> I'd like to see an overload of the coarse_grain() function allowing
> the specification of a DownSampler. I have a need for coarse
> resampling that preserves the largest value in the original set of
> finer samples, and I'm sure many others would like to use their
> favorite convolution-of-oversamples technique to move from high to low
> sample frequency. In fact, the ability to specify the method used
> when moving up or down in sample frequency should be a first-class
> feature of the coarse_grain() and fine_grain() functions. The default
> behavior is a gross simplification for many users' purposes. Please
> see my note elsewhere on the lack of UpSampler concept docs, making
> writing a custom one more difficult than it needs to be.
Agreed, there should be a general and extensible interface for up- and
down-sampling.
> The TimeSeries concept docs state that range_run_storage::pre_value( s
> ) returns "The value associated with the pre_run, if there is one."
> Does it return TimeSeries< S >::zero_type() otherwise? Is there any
> way for the user to tell the difference between a return value that
> happens to be equal to zero_type() and one that was never set? These
> same questions apply to post_value.
From the InfiniteRangeRunStorage concept requirements about
pre_value(), the docs say, "Returns the value of the pre-run. If the
pre-run is empty, the return value of pre_value() is undefined."
http://boost-sandbox.sourceforge.net/libs/time_series/doc/html/InfiniteRangeRunStorage.html
Basically, you call pre_run() to see if the pre-run is empty. If so, you
don't call pre_value().
> I'm not really sure why dense_series<> does not allow floating point
> offsets. I understand that it is supposed to resemble std::vector.
> However, there is implicitly a relationship between the indices of the
> underlying vector and the times represented by the indices. It is
> possible (and quite convenient) for dense_series<> to hold a
> discretization_type that represents the start offset, and an interval,
> and perform the mapping for me. The lack of this mapping means that
> for dense time series of arbitrary discretization (say 0.5), I have to
> multiply all my times by 0.5 in user code when using the time series.
> I feel the library should take care of this for me; the fact that the
> underlying storage is vector-like should not force me to treat
> dense_series<> discretizations differently from all other series'
> discretizations.
How does it force you to treat discretizations differently? Whether your
discretization is 1 or 0.5 or whether your underlying storage is dense
or sparse, it doesn't affect how you index into the series, does it? I'm
afraid I've missed your point.
As to whether a dense series can be indexed with a floating point value,
and whether some on-the-fly time-dilating transformation can be applied
to the index, those are all very interesting questions. There is a
scaled view, a clipped view and a shifted view, but no view which
modified indices by a multiplicative factor.
> I find it a little confusing that discretization is both a template
> parameter and a ctor parameter. It would perhaps be a bit more clear
> if template-parameter Discretization were DiscretizationType or
> something.
OK.
> In general, I think that this library would be far more
> usable if there were more clarification of the relationships among
> discretization, offsets, and runs/intervals. What I'm getting at here
> is that the current structure makes it really unclear what
> DiscretizationType=double, discretization=2.1, offset=3.2 means for a
> run. Where is the run? At 3.2, or at 2.1 * 3.2? The fact that I
> have to think about this is probably simply a failure of
> documentation.
Possibly. Documentation always seems to be the hardest part.
> Nonetheless, it would be best if it were possible to
> specify that a sample exists at offset X, where X is double, int, or
> units::seconds, without worrying about any other details, including
> discretization. That is, discretization seems useful to me only for
> regularly-spaced time series, and seems like noise for
> arbitrarily-spaced time series.
Discretizations are useful for coarse- and fine-graining operations that
resample the data at different intervals. This can be useful even for
time series that are initially arbitrarily-spaced.
Sometimes you don't care to resampmle your data at a different
discretization, or call the integrate() algorithm. In those cases, the
discretization parameter can be completely ignored. It does tend to
clutter up the docs, but no more than, say, the allocator parameter
clutters up std::vector's docs.
> In addition, a sample should be
> representable as a point like 3.14 or a run like [3.14, 4.2).
A zero-width point, like [3.14, 3.14)? What that would mean in the
context of the time_series library is admittedly still an outstanding
design issue.
> * What is your evaluation of the implementation?
>
> I didn't look too closely at the implementation, due to lack of time,
> but I would like to know why the functors in boost::range_run_storage
> are const references to singletons in an anonymous namespace. What is
> that supposed to accomplish that other techniques do not? One of the
> great things about boost is learning new advanced techniques and
> discovering best practices for sticky coding situations. Without the
> why, this technique is lost on me, and probably others. Note that I'm
> not raising any issue with this implementation choice -- I'm just
> looking to understand why this was done.
It's to avoid violations of the one-definition rule. If they were just
global constants, then each translation unit would have their own copy
of the globals (try it and see), and templates that refer to them would
end up with different references when instantiated in different
translation units. It's a nit-picky and rather unimportant
implementation detail, but probably worth a mention in the (not there)
rationale section.
> * What is your evaluation of the documentation?
>
> The docs could use a gentler intro. What I mean by this is that the
> docs appear to be somewhat inverted. Moving from foundations to
> things that build upon those foundations, and from the general to the
> specific would be better. For example, delta_series<> has the
> following description on the first page of the user manual: "A series
> which has exactly one run of unit length." I have no precise idea at
> this point what a run is, and moreover following the link gives a much
> more comprehensible definition in the reference section:
> "boost::time_series::delta_series -- A Mutable_TimeSeries that has a
> distinct value at some unique offset, and zero elsewhere". After
> reading the Range-Run Abstraction section, I think it would be
> appropriate to place that before mentions of runs, specific time
> series models, etc. It was only on the second pass that I fully
> understood the material presented up front in Series Containers.
This is very good feedback, thanks.
> In general, the documentation is skewed towards financial use, but
> doesn't need to be -- the library is useful for other purposes as
> well. For instance, when the predefined resolution types are
> presented, it seems that these are somehow necessary, or that the lack
> of a "seconds" or "milliseconds" resolution typedef might be a
> concern. Further investigation indicates that in fact the
> Discretization can be virtually any numeric type (is that correct?).
Correct, and the consensus is that these should simply be removed from
the library as not useful and actually misleading.
> Placing these at the beginning of the documentation made several of my
> signal processing colleagues think this library would only be good for
> financial users. I suggest a more general emphasis, and a specific
> subsection that says something like "this was originally written for
> financial use, and here are some features that facilitate such use
> ...". For example, here is what I wrote on my first pass through the
> docs; I realize what the situation is now, but here is how the docs
> led me astray a bit:
> "Discretization is only done on days on up. What about seconds,
> minutes, hours, milliseconds (the one I need) discretization? Are
> series with different discretizations usable together without explicit
> user conversion? Also, if mpl::int_<360> is used as a conversion
> factor for yearly, it's wrong for many users' purposes. If it's an
> arbitrary number, how about using mpl::int_<0>, mpl::int_<1>, etc.,
> and adding some other time increments?"
Another suggestion has been to make boost.units usable as the
discretization parameter, and I think that's a promising direction.
> In period_sums() description in the manual section, "summed" is
> misspelled "summer".
Thanks.
> I wanted to figure out whether I could supply my own sampler functor
> to fine_grain(), but it is unclear from the documentation what an
> UpSampler template parameter is supposed to look like. Looking at the
> fine_grain() reference is no help, due to lack of concept
> documentation. Following the samplers::sampler_base<> link was no
> help either, since it appears not to be documented other than listing
> its API. I finally found out what I need to do to write a sampler
> only by looking at the piecewise_upsample implementation.
> The coarse_grain() function says that it requires the coarser grain to
> be a multiple of the finer grain. I assume that should read "integer
> multiple"; is that right?
Right. I could provide more general up-sample and down-sample interfaces.
> After finding the above doc issues, I did a quick survey of the
> functions in the manual's Algorithms section. Most of them appear to
> have complete concept docs, but there are some deficiencies:
> Partial concept docs:
> integrate (missing Series requirements)
> Low-to-no concept docs:
> coarse_grain (missing all requirements except that the discretizations
> need to be integer multiples of one another, but doesn't use the word
> integer)
> fine_grain (missing all requirements except that the discretizations
> need to be integer multiples of one another, but doesn't use the word
> integer)
> invert_heaviside (no requirements at all)
Noted.
> The rest of the algorithm detailed docs have concept requirements, but
> it would be much easier to use them if the concepts were links to the
> relevant concept docs; as it is now, I have to do some bit of
> searching to find each one listed. This applies generally to all
> references to concepts throughout the docs -- even in the concepts
> docs, I find names of concepts that I must then look up by going back
> to the TOC, since they are not links.
Yeah, it's a limitation of our BoostBook took chain. Doxygen actually
emits this documentation with cross-links, but out doxygen2boostbook XSL
transform actually ignores them. Very frustrating.
> The fact that "The only series type that does not support
> floating-point offsets is dense_series<>" should not just be mentioned
> in the manual, but also in the Discretization requirements in the
> dense_series<> reference docs.
OK.
> There appears not to be a rationale section. Even listing the usage
> cases alluded to in the Acknowledgements section may help illuminate
> the choices that were made. Specific things I'd like to see in a
> rationale:
> - Why was commit() chosen, instead of simpler self-contained insertions?
I just answered this one in response to a msg by Steven Watanabe.
> - Why were ordered_inserters chosen instead of more STL-like insertion
> via iterators and member functions?
Also answered in that msg.
> - Why do ordered_inserters wipe out the previous contents of a time
> series, instead of appending/inserting?
Because I picked a bad name. Should be called ordered_builder or
ordered_assign, or something.
> - Why can't elements and/or ranges of elements be removed from or
> added to an existing time series?
They can. See range_run_storage::set_at(). To remove the elements, you
can set_at() with zeros. I should mention this in the user docs though,
because you're not the only person who thought they couldn't do it.
> - Why is the TimeSeries concept interface split between
> boost::sequence namespace and boost::range_run_storage namespace
> accessors?
Hmm. The idea is that the Sequence concepts can be reused -- there's
nothing Time_series specific to it. Ditto for RangeRunStorage. Other
libraries can use them and not have to drag in a bunch of time
series-specific cruft. That's why they're in their own directories, and
in their own namespaces.
> * What is your evaluation of the potential usefulness of the library?
>
> I think it is potentially quite useful. However, I think its
> usefulness is not primarily as a financial time series library, but as
> I mentioned earlier, its current docs make it sound as if it is mainly
> only useful for that. In addition, I am forced to ask how a time
> series library is more useful for signal processing than a std::vector
> and an extrinsic discretization value.
It's for when you want to many options for the in-memory representation
of a series, and efficient and reusable algorithms that work equally
well on all those different representations.
> The answer I came up with is
> that Boost.TimeSeries is really only advantageous when you have
> arbitrary spacing between elements, or when you want to use two
> representations of time series in an algorithm. That is, using
> Boost.TimeSeries' two-series for_each() is almost certainly better
> than a custom -- and probably complicated -- loop everywhere I need to
> operate on two time series. However, these cases are relatively rare
> in signal processing; it is much more common to simply loop over all
> the samples and do some operation on each element. This can be
> accomplished just as well with std::for_each or std::transform.
If std::vector and std::for_each meet your needs, then yes I agree
Time_series is overkill for you. That's not the case for everyone.
> The
> question then becomes, "Does using Boost.TimeSeries introduce
> clarifying abstractions, or conceptual noise?". The concensus among
> my colleagues is that the latter is the case.
Sorry you feel that way.
> Some specific signal-processing usability concerns:
> - For many signal processing tasks, the time series used is too large
> to fit in memory. The solution is usually to use a circular buffer or
> similar structure to keep around just the part you need at the moment.
> The Boost.TimeSeries series types seem unable to accommodate this
> mode of operation.
Not "unable to accommodate" -- making a circular buffer model the time
series concept would be fairly straightforward, and then all the
existing algorithms would work for it. But no, there is no such type in
the library at present.
> - Two of the most potentially useful bits of Boost.TimeSeries for
> certain kinds of signal processing are the coarse_grain() and
> fine_grain(). These do not allow in the case of coarse grain, and
> make difficult in the case of fine grain, the use of an arbitrary
> functor to do downsampling/upsampling.
True.
> - It might be instructive to both the Boost.TimeSeries developers and
> some of its potential users if certain common signal-processing
> algorithms were implemented with the library, even if just in the
> documentation. For example, how might one implement a sliding-window
> normalizer over densely populated, millisecond resolution data? What
> if this normalization used more than two time series to do it's work?
> It may well be possible with the current framework, but a) it's not
> really clear how to do it based on the documentation and b) the
> documenation almost seems to have a bias against that kind of
> processing.
I wonder why you say that. The library provides a 2-series transform()
algorithm that is for just this purpose.
As for the rolling window calculations, I have code that does that, and
sent it around on this list just a few weeks ago. I hope to add the
rolling average algorithm soon. It uses a circular buffer, and would
make a good example for the docs.
> If problems like that were looked at by the developers, it may well
> inform their design by helping them finding new generalizations and so
> forth. Likewise, if methods for using Boost.TimeSeries to do this kind
> of work were present in the documentation, the library would have
> immediate appeal to a whole new range of folks in the scientific
> computing field.
>
> * Did you try to use the library? With what compiler? Did you
> have any problems?
>
> No.
>
> * How much effort did you put into your evaluation? A glance? A
> quick reading? In-depth study?
>
> In-depth reading of the docs, API, and some of the implementation. I
> ran out of time to evaluate much code before the review ended.
>
> * Are you knowledgeable about the problem domain?
>
> Somewhat. I work at a place that does a lot of signal processing
> using time series, and though I work with these time series from time
> to time, it's not my core area.
>
> * Do you think the library should be accepted as a Boost library?
>
> As it stands, no. If there were clearly-defined relationships between
> samples and their extents and offsets; better support for large and/or
> piecewise-mutable time series; a rolling-window algorithm; and better
> customizability of coarse_grain() and fine_grain(), I would probably
> change my vote.
I'm still not clear on what you mean by "clearly-defined relationships
between samples and their extents and offsets." The rest is all fair.
Rolling-window is already implemented, but not yet included.
Thanks for taking the time.
-- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk