From: Daryle Walker (darylew_at_[hidden])
Date: 2001-09-27 18:12:39
on 9/27/01 8:09 AM, dietmar_kuehl_at_[hidden] wrote:
> yesterday we had some discussion concerning XML parser interfaces
> for the Boost library. It was suggested to stick to the SAX and
> DOM "standards". Although I have an idea how SAX parser look like
> in general, I haven't found any document specifying SAX as a
> standard! To my understanding, SAX and DOM are basically interfaces
> specified using OMG IDL and supposed to be releaized according to
> corresponding language mappings.
I don't think SAX is a standard; it's just an interface designed by someone
that a lot of other people liked. I think DOM has some people at W3C
working on it.
> Since OMG's C++ mapping for IDL is, IMO, suboptimal and we don't
> really want to bind the parser to CORBA anyway, I'd suggest at
> least taking liberal freedom when doing the mapping. In general, I
> would realize something which is "event driven" although with
> reversed control: This is what we call an "iterator" in C++. The
> SAX approach is a "pushing" approach: You start the thing and it
> bombs you with events until it is done. Iterators use a "pulling"
> approach. You get handles for the sequence and when you feel like
> it, you obtain the next value. Implementing a pushing interface on
> top of a pulling interface is, obviously, trivial. The other way
> around is basically impossible (unless you store the data you get
> pushed in a sequence and move over this one).
The pulling model does sound better than a push. I think a generator is a
better description than an (input) iterator.
> My approach to an XML parser is to use a tokenizer to chop the
> XML sequence into digestable parts. On top if this tokenizer,
> another iterator verifying well-formedness is sitting. Optionally,
> yet another iterator checking validity is used where both of these
> iterators implement the same concept (something like "XML object
> iterator"; an XML object can be an entity, its attributes, contents,
> etc. and the iterator will tell what it is currently sitting on
> using an appropriate accessor). Implementing either a SAX or a DOM
> interface using such an iterator is pretty simple. The only question
> is whether the iterator level is too low to do validation without
> accomodating data which is duplicated in higher level interfaces
> and thus results in unnecessary overheads.
> ... and why do I want to reinvent the wheel rather than using
> Xerces, which is working after all, in the first place? Well, I was
> successful in using it but personally I consider it a pain. It uses
> idioms imported from some alien world which don't work too well in
> this alien world and work even less in C++. I used it quite a while
> ago but if I remember correctly, there were classes for lots of
> different things which aren't handled different at all. The overall
> interface to do simple things was, IMO, too complex: I want a simple
> interface to do simple things. This saves me the complex interface
> for complex things.
> Since others stated that they will shift priority to do the XML
> stuff soon, this is what I'm doing, too: I hope to get at least a
> rough cut of my XML parser flying on the weekend which is in a form
> suitable for a broader audience.
> Somebody mentioned the possible desire of an XML writer. This is
> something I would realize on top of an interface/concept used to
> traverse trees (or even graphs: these can always be seen in form
> of trees by defining a start node and a traversal rule): DOM is just
> a tree and writing it, is traversing this tree with action on the
> nodes. In general, writing an XML document is just writing a certain
> tree structure which is, however, not necessarily given as a DOM
> tree. BTW, if there is an efficient approach to tree traversals,
> this could very well become the foundation of an XPath component:
> XPath is nothing else then a form of regular expressions over a
> [slightly generalized (*)] tree structure and XPath has, IMO,
> application beyond XML! However, the tree traversal needed for XPath
> has additional requirements over the tree traversal needed to write
> an XML file: For XPath you need the possibility to go "up" from an
> arbitrary node while an XML writer only needs to go "down" or up to
> a node sitting on the path from the root to the current node,
> something which is conveniently handled by a stack.
> (*) The generalization over typical tree structures is the
> attributes axis.
-- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT mac DOT com