Boost logo

Boost :

Subject: Re: [boost] New XML library
From: Cory Nelson (phrosty_at_[hidden])
Date: 2008-12-10 17:13:32


On Wed, Dec 10, 2008 at 1:23 PM, Phil Endecott
<spam_from_boost_dev_at_[hidden]> wrote:
> Themis Vassiliadis wrote:
>>
>> I have been working in a C++ library like Apache Digester
>> (http://commons.apache.org/digester). I'm intending to convert it
>> following boost policies described in Requirements and Guidelines.
>>
>> What are the chances of it become a Boost library ?
>
> Personally I would like to see something like RapidXML in Boost.
>
> It seems that Apache Digester provides an element matching infrastructure.
> This could be useful, as manually iterating through the parse tree that
> something like RapidXML generates can be a bit tiresome. It should probably
> be layered on top of a lower-level XML parser.
>

I have a low level iterator-based parser here:
http://svn.int64.org/viewvc/int64/xml/

The design I've been taking is something like this:

parser.hpp (xml::parser): the lowest level. Given two UTF-32
compatible forward iterators, it returns one of (ok, done, need_more,
error), a node type (element/xmldecl/etc.), and an iterator range.
This parser performs no allocations, and as such does minimal
structural checking. It does however have full character validation,
if you so choose (by a template parameter). Really this does only
slightly more than a lexer, and is available if you want need top
performance and don't need full XML compliance and validation.

reader.hpp (xml::reader): the next level. A UTF-32 push parser that
is fully XML 1.0 and 1.1 compliant, capable of validating the
document, tracking line/column numbers, entity substitution, and other
normal things you'd expect from a parser.

document.hpp (xml::document): a full in-memory document. A modifiable
version, and constant version which uses an arena allocator to stay as
compact as possible.

As of now, only xml::parser is usable- everything but DTD parsing is
complete. I have been really busy these past few months and haven't
got a chance to complete it. The main goal I had when beginning this
is to have something I/O agnostic, that can drop out when it finds an
incomplete stream and be resumed later. It was really important that
it work just as fantastically with parsing from memory, blocking I/O,
or async I/O.

It should also be very performant, which it is: the parser being very
lightweight, UTF-8 decoding is actually a huge bottleneck in my tests
which led me to allow the parser (via template parameter) to work
directly with UTF-8 if you don't require full compliance.

-- 
Cory Nelson

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk