Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2006-06-24 05:52:16


Hi,

I've given a lot of thought lately to Boost.Xml and how it could be
implemented.
In the typical lex/yacc-like scanner/parser separation, XML requires a
very complicated scanner: if the parser is supposed to be streaming, the
scanner must replace all entity references by their replacement token
stream. This is because they can appear almost anywhere, and are mostly
not directly recognized by the XML grammar.
Entity reference substitution is rather complicated, as are the rules
affecting it. For example, pure token stream insertion by the scanner
doesn't work: the scanner must keep track of whether an entity is
allowed in a certain place (they may appear only /almost/ anywhere),
which depends not only on the previous tokens, but for example whether
the current context is that of an internal or an external DTD subset. In
addition, the scanner should check several constraints of the
replacement text, such as parentheses nesting in content expressions.

The original idea was to implement Boost.Xml using Spirit with the
existing XML grammar. I've given up on that. I believe it is not, with
any reasonable effort, possible to implement a completely compliant XML
parser with the merged scanner/parser-system that is natural to Spirit.
The grammar would have to account for entity references in too many
places, replacing character references at the scanner level would
present problems with the characters representing <, > and &, and so on.

So I'm wondering, does Spirit support in any way the separation of
scanner and parser? Is it possible to write a Spirit grammar
specification that acts on some token type instead of characters? How
much effort would that be? Has anyone done something similar previously?

Sebastian Redl


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk