Boost logo

Boost :

From: Darryl Green (darryl.green_at_[hidden])
Date: 2006-12-14 05:36:35


I (tried to) send this yesterday but some prob with my mailer.
I see the idea of nulls has been discussed a bit by Matt Hurd and James
Jones.

Eric Niebler wrote:
> Hi, Matt.
>
> Matt Hurd wrote:
>
>> On 12/12/06, Jeff Garland <jeff_at_[hidden]> wrote:
>>
>>> Eric Niebler wrote:
>>>
>>>> I'm pleased to announce the availability of a new library for
computing
>>>> with time series (http://en.wikipedia.org/wiki/Time_series). From the
>>>> documentation:
>>
>>
>> Looks very nicely thought out.
>>
>> I'm continually re-inventing series containers and would like to
stop ;-)
>>
>> I can see how the discretization type could help many type of
>> applications though it doesn't suit most of the styles I'm used to
>> dealing with.
>>
[snip]

>
>> it is obviously not limited to this. Holidays in different markets
>> come into play in the sequences even though the discretization would
>> be the same. To solve this I like the concept of "clocks" from
>> intensional programming. Basically, if two series use the same clock
>> then indexed offsets into the sequence make sense, otherwise a
>> matching procedure has to be used of which the most typical is:
>> matched time = most recent time <=reference time
>> some input clock has to be the reference time which is also used for
>> the output. It is not the only way, for example some of the time only
>> what I call correlated matching makes sense, that is the time exists
>> in both (or all if there are more than two) inputs.
>
>
>
>
> So in this case, a "clock" is a discretization *and* a set of
"holiday" offsets for which there is no data?
>
>
>
>> This way you get the benefit of direct sequencing when clocks are the
>> same and fast lookups when they are not. Fast lookups are based on
>> the discretization. Like looking up a name in the phone book, if it
>> is a V you go near the back. Calculate the density (num points/
>> period) and make an educated guess as to the location and binary
>> search from there. This scheme mixes microsecond data and annual data
>> quite freely.
>
>
>
>
> I'm having a hard time trying to image what an interface that uses
clocks instead of discretizations would look like. Could you mock up an
example (pseudo-code is fine)? English is a poor substitute for code.

I'm not sure what this application domain calls for in these
circumstances, but surely this is non-uniform sampling, which is a
fairly different beast algorithmically compared to uniform sampling. I
can see that in some cases the missing "holidays" should just be ignored
(those instants don't really exist) but in others (especially if
considering international markets) I would imagine "the market never
sleeps" and events occurring at any time could be significant.

Would an efficient time value pair -> uniform sampling conversion
(implemented as an adapter) be enough? Or would it need to preserve the
discrete samples/time information somehow? In the latter case it would
seem that a full non-uniform sampling model would be needed? Would a
unique representation for a "null" (missing) sample help?

>
>> Algorithms may chew up degrees of freedom and shorten series, but the
>> clocks will remain the same. For example, a simple moving average
>> over 10 days will not be relevant on the first 9 points. You've
>> chewed up 9 points and your output may reflect this. This is just a
>> simple case. Windowing functions can chew up forward and backwards.
>> Some algorithms may have accuracy requirements that may have minimum
>> input requirements. A simple case is determining the number of points
>> you need to get a certain accuracy for an exponential moving average
>> which deals with a weight sum of infinite points.

I understand this bit.

>>
>> Where this puppy ends up being quite different is that you want times,
>> real times, associated with the series. The obvious thing to do is
>> tuple them, but this messes up passing blocks of data around
>> efficiently to things that only want to deal with sequences and don't
>> care about the time, but sometimes timed tuples make more efficient
>> sense. So you need flexible mappings and alternative types of
>> containers per situation.

I'm less clear on this, though it does sound interesting.

>
> I think the current framework can handle this situation quite
naturally. The offsets need not be integral multiples of the
discretization. The offsets can be floating point, which can be used to
represent exact times. Or as Jeff Garland suggested, the offset could in
theory be a posix ptime (not tested). That way you wouldn't have to pass
around a separate vector representing the times, or make your data tuple
with times.
>

It was only when I read this that I took another look at the docs and
realized that your library is using a non-uniform sampling model. This
is interesting - but could make some algorithms hard to write and/or
much less efficient? I have only a little experience with non-uniform
sampling, but I guess this comes back to just how broad a domain you are
trying to cover with this library. Is it intended to be usable
performance is very important, or is it trading off performance for
generality of representation? Of course, if your data is really sparse
and inherently sampled in a non-uniform way, efficiency may well be
increased by non-uniform sampling approaches, not reduced. I'd be
interested in knowing just where you think the libraries
strengths/applications are vs more general purpose maths libs. Thats not
to say I don't think it has any :-) I'm just not sure of of the intended
scope/emphasis. I guess the uniform sampling case is well catered for by
standard vector and matrix libs, and the non-uniform aspect is the main
distinguishing feature?

Regards
Darryl.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk