Boost logo

Boost :

From: Daniel Walker (daniel.j.walker_at_[hidden])
Date: 2006-04-30 15:45:02


On 4/29/06, Sebastian Redl <sebastian.redl_at_[hidden]> wrote:
> Daniel Walker wrote:
>
> >On 4/24/06, Marcin Kalicinski <kalita_at_[hidden]> wrote:
> >
> >
> >>My knowledge of XML is limited, but I think Dan Nuffer's parser will
> >>parse any valid XML. read_xml however discards all that goes beyond nodes,
> >>attributes, data and comments.
> >>
> >>
> >
> >Isn't the property_tree XML parser originally based on Dan Nuffer's?
> >Couldn't the productions/tokens from the Nuffer parser be added back
> >to read_xml() so that it could at least accept the syntax for all XML
> >files even if it doesn't implement the semantics? I think the runtime
> >overhead of the additional productions in the grammar would be
> >negligible for simple XML files that don't use the features and
> >necessary for XML files that do. It seems to me this could clarify the
> >scope of the parser. The documentation could read something like:
> >
> >"read_xml() preforms non-validated parsing of the W3C recommendation
> >XML 1.1. In addition, as of version 1.3x, read_xml() parses but
> >ignores the following W3C specifications: XML Names, XInclude,
> >XLink/XPointer, XML Schema, XSLT, ..."
> >
> >... changing version numbers as appropriate. Also, it may simplify
> >maintenance as far as pulling bug-fixes/enhancements from the Nuffer
> >parser code-base to property_tree.
> >
> >
> The property tree's parser is, I believe, either a very slightly modifed
> Dan Nuffer parser (just semantic actions were added, compared to the
> file I've seen), or built on the same principle: direct translation of
> the grammar spec in the XML specification. It is, with the exception of
> missing entities, a complete non-validating parser of the XML spec, as
> far as I can see, with the important exception of character set
> compatibility: the parser parses only files in the character set
> specified by the current global locale, and will completely ignore the
> character set specification of the header.

If the missing entities were added, then couldn't we just call it a
non-validating XML parser? The character set issues could be mentioned
as a caveat in the documentation.

> Another missing part may be
> the parsing of the internal DTD subset, which might be (not sure yet) a
> required thing for non-validating parsers.

I've tested this alittle and read_xml() does accept some DTDs. DTDs
are part of the XML specification, though validation is not required.

> In addition, it is an XML 1.0 parser.
The Nuffer parser is XML 1.0, right? If that's the case, why not just
re-incorporate missing features from the Nuffer parser and then the
documentation could say read_xml() is a XML 1.0 parser. That seems
less confusing to me.

> The Namespaces in XML, XInclude, XLink, XPointer, ... specifications are
> all built on top of XML; they are all well-formed XML. "Parsing but
> ignoring" them means nothing and can only lead to misunderstandings.

"parsing but ignoring" may not be the best phrase. I wanted to say
something to indicate that though the parser recognizes constructs
used in namespaces, includes, etc. (because, yes, they are valid XML),
the parser doesn't actually do anything (it just ignores them). For
example, it parses the xmlns attribute in an entity but doesn't
generate a unique qualified name that children of the entity inherit.

Daniel Walker


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk