Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2006-10-28 12:22:20


Hi,

Once again I'm turning to the list for discussion about a design issue
in the XML library. This time I hope to avoid any discussion about the
implementation on the library and focus on interface only.

The interface in question is the reader interface, also known as pull
interface. Like SAX, the pull interface is an event-based interface.
There are a few event types (roughly, StartElement, EndElement,
Characters, and a few more for other XML features), all of which provide
come with some additional data: the element name, the character data, etc.

There are two types of reader interfaces currently in use that I've
found. I've come up with a third. I wonder which the people on this list
would prefer, where they see their weaknesses and strengths. The names
that I've given them are my own creation.

1) The Monolithic Interface
Examples: .Net XMLReader, libxml2 XMLReader (modeled after the .Net
one), Java Common API for XML Pull Parsing (XmlPull) (don't confuse with
JSR 173 "StAX")

In the monolithic interface, the XML parser acts as a cursor over the
event stream. You call next() and it points to the next event in the
stream. From there, you can query its type (usually some integral
constants) and call some methods to retrieve the data. All methods are
always available on the object; calling one that is not appropriate for
the current event (e.g. getTagName() for a Characters event) returns a
null value or signals an error.

Pro: Event objects do not need to be allocated. The parser itself
contains the entire state and can, for example, be passed down a
recursive function group.
Contra: You cannot store raw events: calling next() overwrites the
current data. The parser contains a lot of state. This interface does
not protect you in any way from calling inappropriate methods.

2) The Inheritance Interface
Examples: JSR 173 "StAX"

In the inheritance interface, the event types are modeled as a group of
classes that all inherit from an Event base class. The parser acts as an
iterator, Java style; calling next() returns a reference/pointer to the
event object for this event. You use RTTI or a similar mechanism to find
the type of the event, then cast the reference to the appropriate
subclass. The subclasses then provide access to the data that is
actually available for this event type.

Pro: Cannot call methods that are inappropriate for the event. Event
objects are independent of the parser and can be stored as they are.
This is especially interesting if you have an event-based output system
that uses the same event type: in this case, you can store the events,
shuffle them, edit them, then pass them on to the writer. A proper
analog to the stateful parser is harder to design. The parser contains
less state, as it does not need to store the data that is currently queried.
Contra: Event objects need to be allocated on the heap. In a non-GC
language like C++, this is even more of a problem than in Java, as you
have to use either a smart pointer or have the user responsible for
deleting the object. The scenario of a group of functions as mentioned
above is limited insofar as that if some functions want to process the
same event, they need to be passed the current event along with the parser.

3) The Variant Interface
Examples: None. I believe I came up with this entirely on my own.

The variant interface seeks to combine the strengths of the other two
interfaces. It uses a non-monolithic interface, that is, the parser acts
like an iterator and the data is not stored within it. It does not
return a reference to the event object, though, but instead a
boost::variant of all possible events. This way, heap allocation of the
event object is avoided, together with all the trouble coming with that.
The event type can be determined either by calling variant::which, or
with a variant visitor (type-safe!), or with a special get_base()
function that works like get() but can retrieve a reference to a common
base of all the variant types. (This is possible, although an
implementation does not exist in Boost.)

Pro: Cannot call methods that are inappropriate. The visitor system
allows type-safe usage. (Of course, it also loses you part of the
advantage of a pull interface over a push interface.) Does not need heap
allocation if the event classes are properly designed (i.e. not trigger
the case where the variant allocates heap memory.) Events can be stored,
copied (another advantage over the inheritance interface, which would
require a clone() method for that), and manipulated at will. They can be
pre-allocated, a reference passed to next(), to save even the stack
allocation.
Contra: The issue about a group of function still applies.

Independently of the type of interface chosen, another issue is
important: the scope of the interface. Should it report all XML events,
including those coming from DTD parsing? Should this be a user choice,
or should there perhaps be two interfaces, one "high-level" and one
"low-level"? Should errors be reported as error events, or as
exceptions? Should this, too, be a user choice? How about warnings:
exceptions are inappropriate for them. Should it be possible to disable
them completely?

All comments are welcome.

Sebastian Redl


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk