|
Boost : |
Subject: Re: [boost] Formal Review: Boost.Polygon
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-09-06 13:11:31
I am pleased to be able to review this important library. Boost would definitely
benefit from the addition of a geometry library of the right sort.
I am certainly not a geometry expert, but I have written a number of
things in the
last few years that use geometry in GUIs and for geographic data. I
have written this
review as a potential user by evaluating how well it would work in a
couple of example
applications.
My experience over all has been a little disappointing. I think this
is mainly because
I had very high hopes for the library. The reality has not quite met
my expectations.
As I will explain in detail below, my complaints are mainly things like excessive
warnings and odd misfeatures in the interface. These are all things
that could be
fixed, and they perhaps only indicate that the library has arrived a little
prematurely. Based only on these issues my verdict would be that that
the library
could be accepted after some revisions. But we must also look at the
bigger picture,
i.e. the existence of other competing libraries "in the wings". I will
discuss this
after describing my practical evaluation.
Points starting ** in the following are issues that I think should be
fixed for the
library to be acceptable.
Application 1
-------------
I have an application that can display placename search result markers
on a world map
(see http://chezphil.org/tmp/six_thousand_lakes.jpeg). When no markers
are shown the
screen can be updated at about 20 frames per second; when many markers
are shown
the frame rate drops to the point where usability is impaired. In this
case the vast
majority of the markers overlap each other, and this could be exploited
to reduce the
amount of work that needs to be done by the graphics hardware and hence
boost the
frame rate.
The first step was to add the "magic" required by the library to
support my rectangle
type. The examples don't seem to include a custom rectangle, and the decomposition
of a rectangle into a pair of intervals rather than the more common
pair of points or
point and size, made this a bit tricky. More serious were the
following problems:
** The library does not seem to use any sort of explicit concept
checking; if types do
not satisfy their concepts random errors from deep inside the library
are emitted. The
verbosity and incomprehensibility of errors from complex C++ code is,
in my opinion,
the greatest single failing of the language today. In the face of
these ridiculous
messages, some users will simply abandon trying to use the library (or
the whole
language, and go back to C) or limit themselves to only what can be
done by copying
examples. Fixing this needs effort from the language designers,
compiler writers, and
libraries. I am not certain that concept checking can totally cure
this problem,
but it can only help.
** The library spews out enormous quantities of warnings (gcc 4.1.2), I
think mainly
related to unused parameters. This has to be fixed (and I don't
believe fixing it is
hard).
** The library failed to work with std::pair being used as the interval
type, despite
this being mentioned as a possibility in the documentation. The
problem seems to be
that the std namespace is brought into ADL and std::set is found rather
than the
library's own set function. I have not investigated how hard this
would be to fix;
instead I wrote my own pair template.
Having completed the "magic", I started to adapt my marker drawing
code. My first
attempt was a minimally-invasive version that would check if each
marker covered any
new area and not plot it otherwise:
if (!contains(markers,b)) {
markers |= b;
plot(b);
}
Unfortunately this failed since contains() does not seem to be defined
for a
polygon_90_set and a rectangle. This surprised me somewhat. A
valuable addition to
the documentation would be a matrix indicating which operations are
supported between
which types. Somebody will need every combination of operations and
types, and it
would be useful to see quickly which are provided, which are planned
for a future
version of the library, and which cannot be implemented due to some
restriction of the
library design.
I could have tried to implement contains(polygon_90_set,rectangle) in
terms of e.g.
intersection but I was concerned that that would have much worse
complexity than an
optimised contains() algorithm could have. Which raises another issue:
** Algorithms should indicate their complexity guarantees, and (where
different) their
expected complexity for common inputs.
So I moved on to a more invasive modification where I accumulate all of
the rectangles
and use get_rectangles() at the end to extract a minimal set of areas
to plot:
for (...) {
markers |= b;
}
get_rectangles(rects,markers);
foreach(rect in rects) {
plot(rect);
}
This works, and reduces the number of rectangles that are plotted from
about 6,000 to
about 900. Unfortunately get_rectangles() takes about 100 ms to run,
which is
slightly longer than plotting the redundant markers would have taken,
so overall the
frame rate is not improved.
In that code, I was surprised to see that get_rectangles takes an
output container by
reference. In this case, I have no need to allocate memory and store
these rectangles
at all. I would have prefered some way to get the rectangles as they
were generated
and pass them directly to my plot function, e.g.
for_each_rectangle(markers, &plot);
An output iterator would be a more conventional (but in this case more
verbose) way to
achieve this.
** Unless there is some good reason for using output reference
parameters, algorithms
should use output iterators.
I was also a little confused by the difference - if any - between
"markers |= b" and
"markers.insert(b)". I believe that I know what the former does, but I
worry that
unioning-in each rectangle in turn is inefficient and that it would be
better to
construct a set of rectangles and self-union in one go. Perhaps this
is what insert()
does - yet there is no self_union function, and in practice I seem to
get the same
results. Maybe get_rectangles() has done the self_union. Some
clarification of this
in the documentation would be useful.
Application 2
-------------
I have data for the borders of N. American states and provinces. My
original plan was
to try to reduce these state outline polygons to a set of polylines; my
viewer does
this as a pre-processing step to halve the work it does when plotting
these borders.
But the library really is limited to polygons (hence the name), with no
support for
even representing polylines let alone doing anything with them. It
would be useful for
the documentation to include a "roadmap" section indicating whether
this is being
considered for a future version. It might be worthwhile to define
polyline and other
concepts now, if they are relatively uncontroversial, to avoid multiple implementations
later.
So I decided instead to use this data to implement a "what state is
this point in?"
function. This boils down to reading in the state border polygons and
testing points
using contains(polygon,point).
One issue that I noted while reading in the polygons is that the
provided polygon_data
type is immutable: to read from a file, I need either an exotic
iterator that can be
passed to its constructor, or a temporary. I'm unsure why polygon_data
can't be more
like a mutable std::container so that I can simply read points from a
file and
push_back() in a loop. Trying to use a std::vector<point_data>
required more traits
specialisation that I expected; can't this be pre-defined?
** The polygon_data type should be mutable (i.e. support things like
push_back()), or a
mutable version should be provided, or std::vector/list<point_data>
should work with
minimal traits specialisation.
I also found that polygon_data is specialised by the coordinate type,
not the point
type, so anyone using their own point type will have to provide their
own polygon type.
Is there some reason why I can't have polygon_data<my_point>? Again,
making it easier
to use std::container<my_point> would help.
I also found little mention of winding direction in the documentation.
I haven't
checked whether the data I have winds consistently; what does
unknown_winding actually
do?
As I guessed, contains(polygon,point) is very slow if done naively: the library's
polygon concept (and the provided default implementation) have no way
to determine
whether a point lies within a polygon in better than O(N) time.
Testing first with the
bounding box makes it O(1) in the common case. So I tried to write a
polygon type that
would do this, something like this (but not exactly this!):
struct my_polygon: polygon_data {
rectangle bbox;
template <typename ITER>
my_polygon(ITER begin, ITER end):
polygon_data(begin,end),
bbox(extents(*this))
{}
};
bool contains(const my_polygon& poly, point_t pt) {
return contains(poly.bbox,pt) && contains(poly,pt);
}
Hmm, well that looks OK until the very last "contains"... no doubt
someone will
immediately spot the "right" way to do this. I ended up making
contains() a member of
my_polygon, and this worked reasonably well. Note that the extents()
function does not
work as written above: it takes a reference to a rectangle as an out parameter.
center() does something similar. This seems wrong to me; returning a
point or
rectangle would be more conventional I think.
** extents() and center() should return their results, rather than
using an output
reference.
An important question is what this function should do when given a
coordinate that lies
on the boundary between two (or more) states. One can argue for
various different
behaviours. My preference would be to never return "no state" and to
be deterministic,
but otherwise I don't care. The contains() function has a parameter "consider_touch"
that makes it possible to implement different policies; this seems
sufficient in most
cases, though I'm unsure how you would consider the left and bottom
included and the
right and top excluded in a 90-degree shape (which might be needed for
a GUI).
I have a fixed point type that I use for GPS coordinates (32 bits is
good for lat/lng
GPS positions, but 24 bits isn't) and using it with this library worked
well. The
Coordinate Concept page doesn't say much about the type requirements
("expected to be
integral") and spelling out in more detail what is needed would be
useful. In practice
I must have already implemented everything that it needed.
I would have liked to pursue this example further and see just how fast
the "which
state?" function can be made. For example, simplified versions of the
state borders
could be stored for fast lookup for points not near the borders. But I
ran out of
time. If anyone would like to experiment with this, the data can be
found at
http://chezphil.org/tmp/states.tgz .
Other general notes
-------------------
Operator Overloading : I am not a huge fan of "gratuitous" operator
overloading. In
this case, I find |= for union acceptable, particularly since I think I
can disable
this by not using the boost::polygon::operators namespace. I would
prefer to avoid
the duplicate operators (i.e. + == |).
However, I can't see any way to perform these boolean operations other
than using the
operators, and I think this is an ommission. If I were writing code
that I expected
someone else unfamiliar with the library to understand I would like the important
calls to be conspicuous.
** The library should provide functions to perform the boolean
operations, as well as
operators, for the benefit of those who want to make their code more verbose.
I dislike the use of operators with integers for the grow/shrink
operations. My
concern is that this meaning is non-obvious, and since these operators
are also
defined for boolean operations, there is potential for error:
polygon_set a, b, i;
...
for (int i = 0; i<10; ++i) {
...
a -= b+i; // oops, forgot i is now an int
}
Isotropic Style
---------------
I recall Luke describing his isotropic style on the list a long time
ago, and the
documentation would benefit from an explanation along those lines. One
"FAQ" that it
would need to address is the question of the performance of a run-time
selection of X
vs. Y compared to a compile-time selection in more conventional code.
I'm unsure if the use of caps for e.g. HORIZONTAL and VERTICAL enum
members is Boost
style. They look like macros to me.
I'm not very happy with the use of compass directions as direction
labels, since I
have code in which these directions are definitely not what the axes
actually mean.
Other labels including LEFT and RIGHT would have the same problem in
other contexts.
GUI APIs are inconsistent about whether y increases up or down the
screen, so terms
like "top left" are also problematic. I think about the only safe
terms are
"increasing X axis" etc. I tend to use x0y0 ... x1y1 for corners.
Documentation
-------------
I find the documentation satisfactory, though it could benefit from
some style
improvements (e.g. the font for the example code).
I suggest that an introductory tutorial should be added. This should
fully explain
everything that a new user who has no legacy code needs to know (i.e.
it ignores the
traits and uses the provided _data types.) Then, a second tutorial
(more like the
current material) could explain how users with their own types can use
them with the
library.
Answers to basic questions like "is it header-only?" and "what other
Boost libraries
does it depend on?" should also be added somewhere prominent in the introduction.
That concludes my practical examination of the library.
We then have the question of rival geometry libraries to consider.
Barend has presented a few previews of his library and I have looked at
these, though
not in much detail. My feeling was that Barend had a long way to go
before he would
have something useful, and that even then he may never produce
something that I would
actually be able to use. For example, he has decided to adopt the
conventions of the
OGC; this is not entirely unreasonable, but it means that semantics of
things like
whether border points are inside or outside a polygon are
non-negotiable. Barend has
also recently reminded us of the spatial indexes GSOC project and says
that it is now
included in his library, despite it being the worse code I have ever
seen presented to
Boost.
Despite all that, Barend's recent comments suggest that he has made significant
progress and that he may have something useful to show us in a few weeks.
Barend has also raised questions about the worse-case performance of Luke's
line intersection algorithm, and it seems that Luke accepts that it is
not optimal.
Achieving the best possible algorithmic complexity is something that I
would consider
very important or essential for a Boost library. At the very least,
Luke needs to
document the complexity of his current algorithm and explain why he
cannot do any
better. The views of experts should be sought if this cannot be
resolved by the
review manager.
In view of all this, I suggest that this library should be rejected for
now. This
will tell Barend that he still has an opportunity to present his
library for review,
and that it will be considered on a level playing field. If Barend's
library is
reviewed and found to be more complete, more performant and at least as
usable as this
library, then it should be accepted. On the other hand, if Barend's
library is found
to be deficient in some way (or is not submitted for review), then Luke
will have an
opportunity to resubmit an updated version of this library, which I
anticipate should
be accepted.
Regards, Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk