
Boost : 
Subject: Re: [boost] [Itl] Review
From: Joachim Faulhaber (afojgo_at_[hidden])
Date: 20100305 19:11:54
Hi Phil,
thank you again for your review.
2010/2/23 Phil Endecott <spam_from_boost_dev_at_[hidden]>:
[...]
> Joachim has clearly put a lot of work into what is a substantial and
> welldocumented library. Unfortunately, it seems to me that it is built on
> a weak foundation.
After I have taken some more time to think about an interval_tree
> http://en.wikipedia.org/wiki/Interval_tree
implementation for my library, I arrived at a preliminary assessment:
(1) An interval_tree implementation is not suitable for most
instantiations of interval containers provided by the ITL.
(2) An interval_tree implementation is an important optimization for a
specific instantiation:
split_interval_map<D, Collection<T> >
(3) For most other instantiations of interval containers an
interval_tree implementation will lead to problems, that are not
encountered using the current implementation and will show even worse
performance in certain cases.
(4) So an interval_tree implementation is a desirable optimization for
a particular case, split interval map of collections, but probably not
suitable as a general "foundation" of the library.
(5) The foundation of the ITL is its formal specification on sets of
axioms that clarifies semantics, an abstract interface and a generic
design, that allows for a variety of implementations and
optimizations.
(6) Since an interval_tree implementation would rather be a partial
optimization than the correction of a fundamental flaw, I don't think
it is appropriate to bring the ITL library to a halt for this.
There are no data structures that are superior as such. Each has its
pros and its cons and the user has to take her decision of choosing
the appropriate one, dependent on the characteristics of her specific
problem domain and use case.
The current implementation of inserting interval value pairs into
interval_maps with computation of aggregates on overlap can be viewed
as performing an "eager evaluation". After addition of interval value
pairs, we have always two results available:
(1) The new interval borders that result in the interval_map from
addition of the interval and
(2) The aggregated associated values for the overlapping segments.
Using an interval_tree as implementation, both results are unknown
after insertions and only implicitly available. They can be computed
on demand. This is a "lazy evaluation" strategy and it's the basis of
the better space performance, when collections of objects are involved
that can not be aggregated in a compact way.
One problem here is, that the segmentation of the interval_map, (its
separation into elementary intervals that the aggregation results are
associated to), is only implicit in the data structure. Moreover the
interval_tree, in contrast to the "eager implementation" does not
"recognize", that neighbouring intervals can be joined, because of
equality of the aggregated values associated to the elementary
segments.
In the worst case it is possible that the interval_tree stores N
interval value pairs for an interval_map that could be compressed to
only one interval value pair, since all of the aggregates on overlap
led to identical associated value.
(1) This can happen, if you work with an interval_map<D, set<T> > and
T has a small finite number of values only. If you have an application
with large interval and heavy overlaps the O(N^2) space problem
occurs. But if, due to the limited base set (values of T) a lot of
neighbouring sets may be identical, the eager evaluating interval_map
can get pretty small while the lazy one grows on and on.
(2) Another use case: If your interval_map<D, int> tracks the maximum
of values that are a property of some objects that changes over time.
Let's say you have a party and you track the height of the tallest
guest, similar to the documented example:
http://www.joachimfaulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/examples/partys_tallest_guests.html
// I've heard Obama wants to introduce the metric system ;)
([19:0023:10], 1.65m) // First guest arriving in time, 19:00 is
// 1.95 meters tall.
([19:0022:00], 1.70m) // Second guest
... // May be some more under 2.13m at [19:00..)
([19:0024:00), 2.13m) // Now comes Dirk Nowitzki
... // N more guests that are smaller than Dirk,
// coming later and leaving earlier.
The eager evaluating interval_map grows to the point when Dirk comes.
Then it shrinks to size 1 and stays size 1, when smaller sized guests
enter the party later.
A lazy evaluating interval_map founded on interval_tree, would,
correct me if I'm wrong, grow on and on and would compute queries in
O(log(N)) time instead of O(c) time for the eager interval_map.
Most obvious is the case of an interval_set<T>. The previously
discussed "worst case" (large intervals, that are all overlapping) is
the best case for the eager evaluating implementation here. There will
always be only a single interval node in the implementing std::set. So
space and time of all operations are constant. An interval_tree would
store all N intervals.
May be there are nifty optimisations of the lazy data structure for
that. Yet, I have the impression that the interval_trees are surely
not the foundation for *every* use case within the scope of this
library.
Considering this, I think it would not be appropriate to use the lazy
evaluating interval_trees, as a *general* implementation for interval
maps. And also it would not be a strong foundation for *every* use
case. Rather it would be a more optimal implementation for an
important class of instantiations of split_interval_maps.
Personally I think, questions of implementation, although very
important, are not the foundation of a library. More important are
concepts, design and correctness. Those are the fields that I have
invested a lot in, and I am fairly sure, that based on this
foundation, a lot of useful and exciting optimizations can be and will
be done in the future evolution of the library.
I have particularly invested great effort in the investigation of the
semantics of interval sets and interval maps creating sets of axioms
that formally specify *what* the objects of my library will do and not
*how* they will do it. This is the part, that I feel is the foundation
of my library and I think it is a strong foundation. It provides not
only a formal specification of its behavior but, in the extended part,
a automated test tool to validate any given implementation against
this specification.
Thank you again for focusing on the important issues of complexity and
implementation.
Best regards,
Joachim
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk