Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2006-09-06 20:08:23


A few months ago I said I'd take on writing an XML library for Boost.
Well, I've finally got some time on my hand and started with a little bit
of brainstorming about the library.

I've written up my thoughts in this hopefully halfway comprehensible
document and would like to hear everyone's suggestions, opinions, advice,
requirements, etc.

Especially real-world requirements, as my own are just that of a single
person, and that isn't exactly a good basis for a general-purpose XML

The current brainstorming is only for the XML reading side of things. The
writing side will come afterwards.

So here goes.

Purpose of document:
Identify important decisions in the design of a C++ XML parser library.

Pull-API (StAX), Push-API (SAX), Object-Model-API (DOM)?

- All of them, of course! The main question is, which one is the base API?
- DOM is out of the question (performance/memory overhead).
- Implementing a push parser on top of a pull parser is trivial:
        while(fetchEvent()) pushEvent()
- Implementing a pull parser on top of a push parser requires at least
        generator-style coroutines. This occurs a performance overhead at best,
        unusability at worst (in limited environments).
- It is therefore best to use a pull model at the lowest model, although this
        makes the parser implementation more complex.

2) Pull Interface
There are several models of pull interfaces in use. These are for Java and
so a C++ parser does not necessarily have to use any of them.

Existing APIs:
--- Java
- XMLPull
- StAX (JSR 173)
- Xerces XNI Pull Configuration
--- .Net
- .Net XmlReader
--- Python
- Python has very little material on pull parsers. There seem to be some
        available, but they're not popular.
--- Ruby
- Ruby has a built-in XML library with pull support. The API is not yet
        but seems to resemble .Net's XmlReader.
--- C++
- An early version of XPP has a C++ implementation. The interface is

API Styles:
>From the above APIs, we can gather the following:
- Pull parsing always involves calling a method to obtain the next piece of
        document information (called "event"), then processing that piece.
- Two main models seem typical:
-- StAX has a nextEvent() method that returns a reference to an object,
        identified by a base interface XMLEvent. This reference can then be cast to
        the appropriate sub-interface. This is the polymorphism approach.
-- XMLPull and .Net's XmlReader also have a next()/read() method. However,
        do not return an event object but instead store the information internally,
        to be queried by special methods. This model is also used by REXML, Ruby's
        parser. This is the monolith model.

Polymorphism pro:
- State is not held in parser object. Calling next() does not necessarily
        discard the old information.
- Once the correct interface is obtained, all methods on it are guaranteed to
        work. With the monolith model, calling the wrong method may lead to
        exceptions or error returns.

Monolith pro:
- Does not need to allocate an object for each parse event. Can in fact hold
        information in a very compact way internally.
- No casts necessary.

- What are the options that a C++ API has?
- Polymorphism-style API. Return smart pointer? Returning allocated object in
        raw pointer unacceptable. Returning pointer to static storage possible, but
        is basically monolith in disguise. Cannot pass an existing object IN to be
        filled with data.
- Monolith-style API. Means, among other things, that passing the current
        to a function means passing a reference to the entire parser. Furthermore,
        it is not possible to pass a small object containing only the data from the
        actual event to a function, unless that object is written by the API user.
        Thus, every function would have to either have its own switch on the event
        type or assert that the passed-in object contains the right data.
- Union of events, like a Boost.Variant. This seems a good compromise between
        polymorphic and monolithic approaches:
        - State not in parser object, but separate.
        - No dynamic allocation: Variant is usually stack-based.
        - Obtain the actual object from it and use that. Check is required only
                once, other uses are statically checked.
        - Can pass in the variant as an out parameter, saving even the copy.
        Of course, the variant has downsides, mainly that you have to either cast or
        use a static visitor.
        - Other downsides?
- Other approaches?

3) Input/Output System
How does the library access underlying storage?

- Since it needs to access resources from various sources, typically
        as URLs, it needs a flexible and runtime-switchable input system.
- In particular, it should be possible to plug schema resolvers in at
        so that program extensions can provide support for, say, the ftp: schema.
- Two basic options:
        - Iterator-based approach.
        - Stream-based approach.
        - Other?
- Iterators are tricky to switch at runtime, and non-trivial to implement.
- Streams are easier to implement, especially in a polymorphic fashion, but
        they are a poor abstraction of things like memory-mapped files.
        Does that matter?
- Streams, not necessarily being random-access, require caching for
        libraries like Spirit to work. Alternative: hand-write the parser. XML is
        not, in my opinion, particularly suited to being implemented with Spirit
- Is it even possible to have iterators model non-blocking I/O?
- Having tried a few experiments, I favour streams. Iterators are somewhat
        to work with in a hand-written parser, especially as they always need to be
        passed in pairs (or as a range).

4) Integration With Other Boost Libraries
What other Boost libraries should Xml work/integrate with?

- For example, does it make sense to provide an interface to the parser
that can
        be used for parsing streaming content? Either non-blocking, with the option
        to parse partial data and hop back on missing content, or a completely
        asynchronous implementation that dispatches SAX events through e.g. ASIO?

5) Parser Back-End / Library Organization

- Should Boost.Xml be a complete XML solution, with a parser, DOM
        and everything?
- Or should it be split into two parts, one being a parser, the other a DOM
        implementation with various construction modes?
- Or should even the core parser be split into the actual text parser and the
        event/pull/whatever interface, so that an HTML or YAML or PYX parser or even
        an algorithmic content generator can be placed behind?
- What, then, is the interface between that parser and the user interface?

6) Other Issues ???

Boost list run by bdawes at, gregod at, cpdaniel at, john at