Boost logo

Boost :

Subject: Re: [boost] [GSoC] Boost.XML
From: Stefan Seefeld (seefeld_at_[hidden])
Date: 2010-03-22 17:08:02


On 03/22/2010 04:35 PM, Phil Endecott wrote:
> Hi Stefan,
>
> First let me say that I fully understand that there are many different
> applications of XML. I get the feeling that you and I have probably
> encountered different subsets of them. My belief is that there are
> different legitimate types of XML library to support the different
> kinds of application.

While I agree with that, that wasn't quite my point. Rather, I tried to
point out that you couldn't only support a subset of XML, and still
claim to provide an XML library.

>> How does it deal with input needing "preprocessing", such as entity
>> substitution, or (X)inclusion ?
>
> "it" here meaning rapidxml. In one mode of operation, it replaces
> entities (i.e. <) during parsing; this obviously breaks the idea of
> not using much RAM since the mmaped file will copy-on-write pages as
> this happens. In another mode it doesn't do this and leaves it as a
> job for the user.

If the user is exposed to it, I would argue this is not a sufficient API
to call itself "XML bindings". The spec has some rather specific
discussion on what ought to be done at parsing, and what the result
would be (e.g. http://www.w3.org/TR/xml-infoset/). I strongly object to
an "XML library" that offers something else. (To be clear: I certainly
don't object to such libraries in themselves, but please don't confuse
"XML" with "XML-like".

>
> In my library I have an iterator that processes a text node decoding
> entities as they are encountered. This currently only recognises the
> "default" entities i.e. lt, gt, amp, quot, apos and numerics. It
> would be possible to extend this to decode entities declared in the
> document, if that were necessary, but it's not something I've ever
> needed to do.

Fine. Again, the XML spec clearly defines when and how entities ought to
be handled (http://www.w3.org/TR/REC-xml/#entproc). And to the degree
that this processing is specified, an XML library ought to honor it.

>
> I believe that a lot of XML features like entity declarations and
> namespaces declared not in the root element are painful precisely
> because they are tedious to implement, detrimental to performance, and
> never used in real-world XML documents. My guess is that you would
> disagree with that.

I don't disagree, but I think that the world doesn't need yet another
library that supports some Not-Quite-XML.

>
> Neither rapidxml nor my library supports xinclude. In my case, I can
> imagine adding it by modifying the element iterator such that
> dereferencing an xi:include element would open the referenced document
> and return its root element.

That, too, is not confirming to the XML spec
(http://www.w3.org/TR/xinclude/#processing)

>
>> Also, this clearly only works with immutable input.
>
> I think rapidxml lets you modify a document; it must allocate storage
> for the new strings somewhere and update its tree to point to them.
> My library does not allow this. I don't think I've ever needed to
> modify an XML document: I have only either read in or written out a file.

Again: that's fine, and I agree it would be great for a boost.xml
library to optimize for that code. However, I don't think it should
optimize for it by disallowing the infoset to be modified.

> Default attribute values defined in a DTD are an excellent example of
> an XML misfeature not used in any XML application that I care about
> that simply result in XML processors being more complex and slower
> than they would otherwise need to be. (Please feel free to list any
> XML applications that make use of them.)

Same argument. You may not care, but others do.

>
> However, I wouldn't say that these features are fundamentally
> incompatible with my approach in this library. It's only necessary
> that when you look up an attribute, the returned range somehow
> includes pseudo-elements corresponding to the default attributes.

I certainly expect an attribute iterator to make no distinction between
explicitly specified attributes and default attributes. The XML spec has
a clear definition of an InfoSet, and what of an XML file actually is
semantically relevant and what is not. I want boost.xml to honor those
semantics.

Thanks,
         Stefan

-- 
       ...ich hab' noch einen Koffer in Berlin...

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk