Boost logo

Boost :

Subject: Re: [boost] [GSoC] Boost.XML
From: Stefan Seefeld (seefeld_at_[hidden])
Date: 2010-03-22 15:02:05


On 03/20/2010 07:00 PM, Phil Endecott wrote:
> Ilie Halip wrote:
>> I have a few questions about the Boost.XML project.
>>
>> First, what actually needs to be done?
>
> Shall we have another thread about what a good C++ XML library would
> look like? It's been a while since the last one...
>
> I have done a couple of projects using rapidxml, and until recently my
> feeling was that it was close to the best design. If you're not
> familiar with it, it holds the XML in memory (e.g. as a memory-mapped
> file) and does a single-pass parse that builds up a tree that points
> into the original XML for the strings. This is fast and reasonably
> memory-efficient.

How does it deal with input needing "preprocessing", such as entity
substitution, or (X)inclusion ?

Also, this clearly only works with immutable input.

>
> However, recently I needed something that used less memory. I wanted
> to process a very large file without ever having all of it in memory
> (imagine e.g. loading a database). So I wrote something where the
> element and attribute iterators (etc.) are just pointers into the
> (memory-mapped) XML source. When an iterator is incremented it steps
> through the source looking (textually) for the start of the next
> element or attribute (etc.). The result is something that uses almost
> no memory and is fast for the sorts of access pattern that I needed.

But to what degree is that really XML ? In addition to the above
concerns, there are other aspects that may result in the generated
infoset to differ from the XML storage. For example, attributes with
default values (as per DTD), should arguably still be "seen" by an
attribute iterator, while a naive iteration over the explicitly
spelled-out attributes won't. Etc.

In short, I can certainly see and appreciate cases where your approach
has advantages. The same is true for lots of other approaches. However,
none of these should really claim to be an XML API, if it doesn't allow
to support the full spec.

>
> An interesting observation is that both a rapidxml-like method and my
> new method could have very similar interfaces, albeit with different
> complexity (c.f. std::vector vs. std::list). So it is interesting to
> consider whether something like an XPath engine could be designed in
> terms of an interface to multiple back-end "XML containers", if they
> shared the same interface.
>
> In fact, something "XPath-like" but also more "C++-like" would be the
> next step to improve the "user" code in my application. Currently I
> have too much verbose iteration looking for the elements that I want.
> It would be great to have a XPath-like DSL for finding these
> elements. (An application for Proto?)

The same applies here. XPath is a well defined specification. While I
can definitely see not everyone needing all its features, I think it's a
very bad idea to even consider going down that route where you get tons
of "XPath-like" APIs, all mutually incompatible in their features and
approaches.

     Stefan

-- 
       ...ich hab' noch einen Koffer in Berlin...

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk