Boost :

Date view	Thread view	Subject view	Author view

From: Eric Niebler (eric_at_[hidden])
Date: 2007-08-07 14:06:00

Next message: Greer, Joe: "Re: [boost] Development and Release Practices"
Previous message: Paul A Bristow: "Re: [boost] Development and Release Practices"
In reply to: Stjepan Rajko: "Re: [boost] Time Series review - 7/30-8/8"
Next in thread: Eric Niebler: "Re: [boost] Time Series review - 7/30-8/8"
Reply: Eric Niebler: "Re: [boost] Time Series review - 7/30-8/8"
Reply: Steven Watanabe: "Re: [boost] Time Series review - 7/30-8/8"
Reply: Matthias Troyer: "Re: [boost] Time Series review - 7/30-8/8"

Stjepan Rajko wrote:
> On 7/31/07, John Phillips <phillips_at_[hidden]> wrote:
>> My apologies for the delay in this posting, but the review period for
>> the Time Series library submitted by Eric Neibler runs from Monday, July
>> 30 until Wednesday, August 8. From the documentation:
>>
>
> I have started to review the library, and at this point I am confused
> :-) So I thought I would share my thoughts in case someone can
> de-confuse me and I can produce a more useful review.
>
> To preface, I would be *very* interested in having a time series
> library in Boost. I use time series frequently in my work, although I
> wouldn't call myself an expert.
>
> That said, it took me a while to understand the language of the
> documentation. The whole "runs, discretization, coarseness" business
> didn't readily make sense to me. The lightbulb finally went off when
> I saw:
>
> "The integral of a series is calculated by multiplying the value of
> each run in the series by the run's length, summing all the results,
> and multiplying the sum by the series' discretization."
>
> Aha! Now I knew exactly what run and discretization meant - it is as
> if you are modeling a piecewise constant function with something like
> a scalable "x" axis.
>
> I went on to experiments with some dense, sparse, and piecewise
> constant series to see how they behave. The first thing that struck
> me as odd was:
>
> dense_series<double, daily> dense(start = 0, stop = 2);
> sparse_series< double, daily, double > sparse;
>
> sparse = dense; // Compilation error! (GCC 4.0.1/darwin)

When assigning one series type to another, or using two series in
arithmetic expressions, the offset types must be compatible. For the
above, the offset type (not specified) is "std::ptrdiff_t". For the
sparse series, it is a double.

It wasn't clear to me what the right behavior should be, so to be
conservative, I disallowed mixing series with different offset types.

> However, if I change the discretization type to int, all is good. Is
> there a reason?
>
> Before assigning to sparse, I populated the dense series with:
>
> ordered_inserter< dense_series< double > > in_d( dense );
> in_d(10,0)(11,1)(12,2).commit();
>
> After assigning to sparse, I was a little surprised at what sparse was
> spitting back out via the [] operator. Basically, sparse[0] == 10,
> sparse[1] == 11, sparse[2] == 12, and seemingly 0 everywhere else, but
> the docs say:
>
> sparse_series<>
> A sparse series where runs have unit length.
>
> So how come I get sparse[0.1] == sparse[-0.1] == 0? Shouldn't there
> be a unit length interval around 0 where sparse is 10? I think you
> may have commented on this recently, but I found it rather unexpected,
> given what the docs say (otherwise, perfectly good behavior).

The docs are admittedly unclear on this point, and part of the reason
for the confusion is that time series with floating-point offsets were
added after this part of the docs were written.

Sparse and delta are more properly thought of as "point" data
structures, where the coordinate of each datum can be represented with a
single offset. So a sparse series with a 42 at 3.14 has a 42 *exactly*
there and nowhere else. Does that help?

> I moved on to:
>
> piecewise_constant_series< double, int, double > pc;
> pc = dense;

If that compiles, it's a bug. The offset types should be required to match.

> And now more weirdness... pc[x] == 10 for x \in [0, 1] (inclusive)!!!
> i.e., pc[0] == pc[1] == 10, but pc[1.01] == 11.

Yuk, that's awful. It seems this is a good reason to disallow mixing
series with floating-point and integral offsets in the same expression.
See below for more on this.

> The first reason why this strikes me as odd is, if I start with a
> discrete time series (as dense_series models very nicely), and put it
> into something that expands it to a piecewise constant function, the
> choice of translating "value 10 at time 0" to "value 10 at interval
> [0, 1]" is not obvious... if I wanted approximation, I would be more
> likely to assign to a particular pc[x] the value of the closest point
> in dense - i.e. I'd be more likely to set pc[0.9]==11 because 0.9 is
> closer to 1 than to 0. That strategy would make it difficult to deal
> with the infinite pre-run and post-run, though, and you could argue
> that assignment is not about this kind of approximation but about
> something else - so you may want to specify in the docs what the
> semantics of conversion through assignment is.
>
> The second reason why I find this odd is, I would at least expect that
> for an explicitly specified point such as dense[1], pc[1] would have
> the same value after assignment.
>
> Anyway, at this point I stopped experimenting as it seemed to me that
> there was something fundamental that I wasn't getting.

I'm sorry you ran into trouble with floating-point offsets. To be
honest, they were added to the time_series library as a bit of an
after-thought, and they are admittedly not a seamless addition. With
integral offsets, runs are in fact half-open. Floating-point runs are
not -- it's not clear what that would mean. (E.g., a delta series D has
a 42 at [3.14,3.14) -- that's a zero-width half open range! What should
D[3.14] return?) Some of the algorithms won't work with floating-point
offsets. If you feel this part of the library needs more thought, that's
a fair assessment. I'm certain that had you used integral offsets your
experience would have been less frustrating.

I think with better docs and more strict type checking, integral and
floating point offsets can both be supported without confusion.

>>From reading the docs, I do have some other comments:
>
> "It may seem odd that even the unit series have a value parameter.
> This is because 1 is not always convertible to the series' value type.
> Consider a delta_unit_series< std::vector< int > >. You may decide
> that for this series, the unit value should be std::vector< int >(3,
> 1). For unit series with scalar value types such as int, double and
> std::complex<>, the value parameter is ignored."
>
> Question: if I was dealing with something that needs to have the "1"
> specified, couldn't I just use delta_series instead of
> delta_unit_series?

You certainly could, but there are extra optimization opportunities if
all runs are known at compile time to have a "unit" value. The library
may short-cut multiplication, for instance, if it knows that one operand
will always be 1.

> About discretization: your prenamed discretizations start with with
> daily as mpl::int<1>. That's great if that's how the integrals need
> to be calculated. However, in a lot of cases one would want to have
> "secondly" as mpl::int<1>. Could you provide another set of prenamed
> discretizations that provide that option? Although, if someone used
> both of them in the same app, then compile-time checking would
> consider them equivalent, no? Could the units lib be used for this
> instead?

"Daily" is its own type, not a typedef for mpl::int_<1>.

struct daily : mpl::int_<1> {};

So if your your application you define your own discretizations like:

struct secondly : mpl::int_<1> {};

compile-time time checking would still consider them different. It's the
type that matters for the compile-time checks, not the value. The value
is only used by the integrate() algorithm.

I really don't want to add an alternate set of named discretizations.
One could argue that I shouldn't even be providing any. I haven't looked
into using the units lib for this, but it seems like a reasonable idea.

> About the docs: time series can be expressed very clearly visually -
> I think that the docs would benefit greatly from some visual examples.

True.

> I guess that's it for now. Sorry about my confusion :-) I can try to
> give a little more feedback if I start to understand the fundamentals
> of the library better. Back to bed for me... I actually tried to go
> to bed earlier but couldn't sleep because I was thinking about this
> lib. So I woke up to write this. Weird.

Sorry to keep you up! ;-)

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com
The Astoria Seminar ==> http://www.astoriaseminar.com

Next message: Greer, Joe: "Re: [boost] Development and Release Practices"
Previous message: Paul A Bristow: "Re: [boost] Development and Release Practices"
In reply to: Stjepan Rajko: "Re: [boost] Time Series review - 7/30-8/8"
Next in thread: Eric Niebler: "Re: [boost] Time Series review - 7/30-8/8"
Reply: Eric Niebler: "Re: [boost] Time Series review - 7/30-8/8"
Reply: Steven Watanabe: "Re: [boost] Time Series review - 7/30-8/8"
Reply: Matthias Troyer: "Re: [boost] Time Series review - 7/30-8/8"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk