Subject: Re: [boost] [Itl] Review
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2010-02-23 19:52:31
Joachim Faulhaber wrote:
> Hi Phil,
> 2010/2/23 Phil Endecott <spam_from_boost_dev_at_[hidden]>:
>> My review of the proposed Interval Template Library follows. This
>> is based only on studying the documentation, and previous discussion
>> of the library on the Boost mailing list (e.g.
>> http://lists.boost.org/Archives/boost/2009/03/149396.php ).
>> Joachim has clearly put a lot of work into what is a substantial and
>> well-documented library. Unfortunately, it seems to me that it is
>> built on a weak foundation.
>> One of the most well-known problems involving intervals is to
>> determine which of a given set of intervals contains a particular
>> point. How to store a set of intervals in such a way as to make
>> this problem efficient has been extensively studied. If an
>> "interval tree" (See e.g. Wikipedia:
>> http://en.wikipedia.org/wiki/Interval_tree) is used, then the
>> problem can be answered in logarithmic time and linear space. (More
>> precisely: if there are N intervals in the container and M overlap
>> the point of interest, finding them should take O(M + log N) time
>> and the data structure should take O(N) space.)
>> I was hoping to find that this library was an implementation of an
>> interval tree. But it is not: it is built on top of normal
>> standard-library containers. The implementation used can find the
>> intervals that overlap a point in O(M + log N) time, but to do so it
>> needs worst-case O(N^2) space and I think it needs O(N^2 log N) time
>> to build that structure.
> If you insert N intervals in an interval_map the number of intervals
> (nodes) in the map is limited to 2*N-1. So space is O(N). A single
> addition of an interval into an interval_map has worst case time
> performance of O(N). So the worst case time for the addition of N
> interval value pairs is O(N^2). Which is not as bad as you write.
Let us say that the interval map is a map of interval to set of ids of the original input intervals, perhaps their index in an array. In the worst case all intervals overlap all other intervals and have unique end points. The number of ids stored in the set for each of the O(n) elements in the map is O(n) and the data structure is in total O(n^2) in size. Because each set of ids is O(n) the insertion of each id is O(log n) and the total runtime complexity is O(n^2 log n) to build the data structure. The interval map could then be used to lookup the set of intervals that intersect a specific value in O(m + log n) time. I think this is the specific usage of the interval map that Phil had in mind. You can perform set operations on them, but even then I wouldn't want to use the tree insertion based algorithm if the two maps were of similar size since the scanning algorithm is O(n). I also wouldn't store interval maps as trees if I wanted to do interval set operations on them since vector storage will be much more compact and the O(n) algorithm will run much faster.
There is a similar problem with polygon sets as interval sets, in that people generally want to construct them one polygon at a time, but want optimal runtime complexity for producing the final merged set. I defer the merge until it is requested by the user implicitly by calling an API that requires merged data. You could do the same thing with an interval set by storing a vector of intervals and then sorting and scaning it. You get a faster O(log n) lookup in a sorted vector than a tree. The only use case for the tree based interval map or set is when the user wants to interleave queries with insertion and removal. That use case has its place, but it shouldn't come at the expense of efficiency of other use cases. The most common use case is to use the interval map as a dictionary that is rarely or ever changed after it is initially constructed. The user needs to understand their use case and have the option of choosing the appropriate data structures and algorithms for their needs.
Now I hope we are all on the same page about these issues and understand that they are solvable in the long term.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk