Boost logo

Boost :

From: Matt Gruenke (mgruenke_at_[hidden])
Date: 2006-10-29 03:35:28


Sebastian Redl wrote:

>The interface in question is the reader interface, also known as pull
>interface. Like SAX, the pull interface is an event-based interface.
>
>

This confused me. I've always heard event-driven or callback-based
interfaces described as "push", since the user's code gets invoked by an
external event source. Do I correctly understand that you're talking
about a SAX-like interface (in that it processes the document in-order,
and limits visibility to one node at a time) that's "pull" (i.e. user
code calls the parser) instead of "push" (i.e. parser calls user
provided methods)?

>There are two types of reader interfaces currently in use that I've
>found.
>

You mean two types of "pull" interfaces, right?

>1) The Monolithic Interface
>
>
>All methods are
>always available on the object; calling one that is not appropriate for
>the current event (e.g. getTagName() for a Characters event) returns a
>null value or signals an error.
>
>
>Contra: You cannot store raw events: calling next() overwrites the
>current data. The parser contains a lot of state. This interface does
>not protect you in any way from calling inappropriate methods.
>
>

I don't see any fundamental reason why such an interface can't support
inheritance.

If I were using an interface that let me iterate over the document, I'd
at least want to be able to decide whether to get the next sibling node
or the next child node. A pull interface, like this, could not only
support copying the current node, but could even copy entire subtrees
(e.g. copyUntilNextSibling()) - though this operation would require
dynamic memory allocation.

>2) The Inheritance Interface
>
>
>Contra: Event objects need to be allocated on the heap.
>

Why does an inheritance-based parser need to store objects on the heap?
If the memory is owned by the parser, it can pre-allocate a temporary
object for each type of node. Based on the type of node, it fills the
temporary object of the appropriate type, and returns a const ref to
that object.

If the caller wants a copy, the copy would only have to be
heap-allocated if copied via some virtual function in the node
base-class. For the concrete classes (obtained via dynamic-casting),
copy constructors and assignment operators would work just fine.
Another option (as you point out) is returning a shared_ptr, though this
would slightly complicate the parser's management of its temporary objects.

>It does not
>return a reference to the event object, though, but instead a
>boost::variant of all possible events.
>

That could be big, depending on how much text you buffer. Not only
would it waste memory, but memcpy'ing around all of that could waste
some of the performance savings gained by avoiding heap allocation.
Maybe RVO eliminates some or all of the performance penalty, but it's
probably unwise to depend so much on RVO.

Of course, passing in the result might be the solution - at least to the
performance issues. A parser that allows the user to pass in the result
would also facilitate copying subtrees, if your node type has addChild()
and addSibling() methods.

>Independently of the type of interface chosen, another issue is
>important: the scope of the interface. Should it report all XML events,
>including those coming from DTD parsing?
>

Why re-invent more than necessary? Use DOM and/or pick some other,
existing object models (unless you have specific issues which they don't
address).

>Should errors be reported as error events, or as
>exceptions? Should this, too, be a user choice?
>

I think the biggest reason to avoid exceptions would be for the
performance impact. I don't know whether the difference would be
significant, in the case of XML parsing. However, to get the full
performance benefit, I think you'd need to use empty exception
specifications - in which case the choice would have to be made at
compile-time (at the latest).

Perhaps there are some other benefits to using an iostreams style
error-handling model, where the parser is treated like a stream.

>How about warnings:
>exceptions are inappropriate for them. Should it be possible to disable
>them completely?
>
>

What's a warning? A document is either well-formed, or it's not. The
only possible distinction that comes to mind is perhaps treating bad
syntax as errors and validation failures as warnings. However, you
could basically get the same effect by providing a switch to disable
validation. That way, rather than just ignore warnings, users who don't
care about validation failures could disable validation and maybe also
save some runtime overhead.

Matt


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk