|
Boost : |
Subject: Re: [boost] [GSoC] Boost.XML
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2010-03-22 16:35:45
Hi Stefan,
First let me say that I fully understand that there are many different
applications of XML. I get the feeling that you and I have probably
encountered different subsets of them. My belief is that there are
different legitimate types of XML library to support the different
kinds of application. An open question is whether a common API, or at
least a common API-subset or collection of concepts, could support
those different libraries. My recollection from before was that I felt
a libxml2 wrapper could not be usefully-compatible with the approaches
that I preferred, but you disagreed. I don't think it would be useful
to re-visit that "bikeshed discussion" now, not least because I have
forgotten most of the details...
Stefan Seefeld wrote:
> On 03/20/2010 07:00 PM, Phil Endecott wrote:
>> I have done a couple of projects using rapidxml, and until recently my
>> feeling was that it was close to the best design. If you're not
>> familiar with it, it holds the XML in memory (e.g. as a memory-mapped
>> file) and does a single-pass parse that builds up a tree that points
>> into the original XML for the strings. This is fast and reasonably
>> memory-efficient.
>
> How does it deal with input needing "preprocessing", such as entity
> substitution, or (X)inclusion ?
"it" here meaning rapidxml. In one mode of operation, it replaces
entities (i.e. <) during parsing; this obviously breaks the idea of
not using much RAM since the mmaped file will copy-on-write pages as
this happens. In another mode it doesn't do this and leaves it as a
job for the user.
In my library I have an iterator that processes a text node decoding
entities as they are encountered. This currently only recognises the
"default" entities i.e. lt, gt, amp, quot, apos and numerics. It would
be possible to extend this to decode entities declared in the document,
if that were necessary, but it's not something I've ever needed to do.
I believe that a lot of XML features like entity declarations and
namespaces declared not in the root element are painful precisely
because they are tedious to implement, detrimental to performance, and
never used in real-world XML documents. My guess is that you would
disagree with that.
Neither rapidxml nor my library supports xinclude. In my case, I can
imagine adding it by modifying the element iterator such that
dereferencing an xi:include element would open the referenced document
and return its root element.
> Also, this clearly only works with immutable input.
I think rapidxml lets you modify a document; it must allocate storage
for the new strings somewhere and update its tree to point to them. My
library does not allow this. I don't think I've ever needed to modify
an XML document: I have only either read in or written out a file.
>> However, recently I needed something that used less memory. I wanted
>> to process a very large file without ever having all of it in memory
>> (imagine e.g. loading a database). So I wrote something where the
>> element and attribute iterators (etc.) are just pointers into the
>> (memory-mapped) XML source. When an iterator is incremented it steps
>> through the source looking (textually) for the start of the next
>> element or attribute (etc.). The result is something that uses almost
>> no memory and is fast for the sorts of access pattern that I needed.
>
> But to what degree is that really XML ? In addition to the above
> concerns, there are other aspects that may result in the generated
> infoset to differ from the XML storage. For example, attributes with
> default values (as per DTD), should arguably still be "seen" by an
> attribute iterator, while a naive iteration over the explicitly
> spelled-out attributes won't. Etc.
Default attribute values defined in a DTD are an excellent example of
an XML misfeature not used in any XML application that I care about
that simply result in XML processors being more complex and slower than
they would otherwise need to be. (Please feel free to list any XML
applications that make use of them.)
However, I wouldn't say that these features are fundamentally
incompatible with my approach in this library. It's only necessary
that when you look up an attribute, the returned range somehow includes
pseudo-elements corresponding to the default attributes.
> In short, I can certainly see and appreciate cases where your approach
> has advantages. The same is true for lots of other approaches. However,
> none of these should really claim to be an XML API, if it doesn't allow
> to support the full spec.
OK, I won't call it an XML API if that will make you happy :-)
For the record, rapidxml doesn't even support namespaces. Nor does
pugixml. I do support namespaces but it involves work on the user-side
if you want to recognise namespace declarations below the root
element. Pugixml, like my library, will not even successfully skip
over the DOCTYPE in some cases due to its complex syntax.
>> An interesting observation is that both a rapidxml-like method and my
>> new method could have very similar interfaces, albeit with different
>> complexity (c.f. std::vector vs. std::list). So it is interesting to
>> consider whether something like an XPath engine could be designed in
>> terms of an interface to multiple back-end "XML containers", if they
>> shared the same interface.
>>
>> In fact, something "XPath-like" but also more "C++-like" would be the
>> next step to improve the "user" code in my application. Currently I
>> have too much verbose iteration looking for the elements that I want.
>> It would be great to have a XPath-like DSL for finding these
>> elements. (An application for Proto?)
>
> The same applies here. XPath is a well defined specification. While I
> can definitely see not everyone needing all its features, I think it's a
> very bad idea to even consider going down that route where you get tons
> of "XPath-like" APIs, all mutually incompatible in their features and
> approaches.
OK, I won't call it an "XPath-like" API. I'll just call it a
convenient syntax for extracting the interesting elements from a
XMLwithoutthemisfeatures document.
Regards, Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk