Boost logo

Boost :

From: dietmar_kuehl_at_[hidden]
Date: 2001-09-27 03:09:21


Hi,

yesterday we had some discussion concerning XML parser interfaces
for the Boost library. It was suggested to stick to the SAX and
DOM "standards". Although I have an idea how SAX parser look like
in general, I haven't found any document specifying SAX as a
standard! To my understanding, SAX and DOM are basically interfaces
specified using OMG IDL and supposed to be releaized according to
corresponding language mappings.

Since OMG's C++ mapping for IDL is, IMO, suboptimal and we don't
really want to bind the parser to CORBA anyway, I'd suggest at
least taking liberal freedom when doing the mapping. In general, I
would realize something which is "event driven" although with
reversed control: This is what we call an "iterator" in C++. The
SAX approach is a "pushing" approach: You start the thing and it
bombs you with events until it is done. Iterators use a "pulling"
approach. You get handles for the sequence and when you feel like
it, you obtain the next value. Implementing a pushing interface on
top of a pulling interface is, obviously, trivial. The other way
around is basically impossible (unless you store the data you get
pushed in a sequence and move over this one).

My approach to an XML parser is to use a tokenizer to chop the
XML sequence into digestable parts. On top if this tokenizer,
another iterator verifying well-formedness is sitting. Optionally,
yet another iterator checking validity is used where both of these
iterators implement the same concept (something like "XML object
iterator"; an XML object can be an entity, its attributes, contents,
etc. and the iterator will tell what it is currently sitting on
using an appropriate accessor). Implementing either a SAX or a DOM
interface using such an iterator is pretty simple. The only question
is whether the iterator level is too low to do validation without
accomodating data which is duplicated in higher level interfaces
and thus results in unnecessary overheads.

... and why do I want to reinvent the wheel rather than using
Xerces, which is working after all, in the first place? Well, I was
successful in using it but personally I consider it a pain. It uses
idioms imported from some alien world which don't work too well in
this alien world and work even less in C++. I used it quite a while
ago but if I remember correctly, there were classes for lots of
different things which aren't handled different at all. The overall
interface to do simple things was, IMO, too complex: I want a simple
interface to do simple things. This saves me the complex interface
for complex things.

Since others stated that they will shift priority to do the XML
stuff soon, this is what I'm doing, too: I hope to get at least a
rough cut of my XML parser flying on the weekend which is in a form
suitable for a broader audience.

Somebody mentioned the possible desire of an XML writer. This is
something I would realize on top of an interface/concept used to
traverse trees (or even graphs: these can always be seen in form
of trees by defining a start node and a traversal rule): DOM is just
a tree and writing it, is traversing this tree with action on the
nodes. In general, writing an XML document is just writing a certain
tree structure which is, however, not necessarily given as a DOM
tree. BTW, if there is an efficient approach to tree traversals,
this could very well become the foundation of an XPath component:
XPath is nothing else then a form of regular expressions over a
[slightly generalized (*)] tree structure and XPath has, IMO,
application beyond XML! However, the tree traversal needed for XPath
has additional requirements over the tree traversal needed to write
an XML file: For XPath you need the possibility to go "up" from an
arbitrary node while an XML writer only needs to go "down" or up to
a node sitting on the path from the root to the current node,
something which is conveniently handled by a stack.

(*) The generalization over typical tree structures is the
    attributes axis.

--
<mailto:dietmar_kuehl_at_[hidden]> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk