Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2001-09-26 15:13:43


You use some spare money to start reading up on XML, and look what
happens....

1. Regression

I suggested using XML as the output for the regression tests. We could do a
quick-and-dirty printing of XML code with the current HTML code, but I don't
think it's worth it. Eventually, we could output just XML and use a XSLT
conversion to HTML, but that's a long way off.

2. XML

For now we can assume that the XML files just contain the basic C++
characters. This is convenient for the popular modern platforms, since they
all grok ASCII (but not necessarily Latin-1 or Unicode) and the basic C++
characters are an ASCII subset. People with EBSDIC will be able to use the
initial library too, but will get problems when we start the Unicode
transition.

We could use an event-based model, so we don't have to be limited by memory.
Also, a memory-tree model can be built over an event-based model, but not
the other way around.

For the parser, we can go in steps:

+ Basic parser; this is the step that a "beginning CS student can do in a
week." All it does is reads a stream and returns the pieces and checks for
well-formed files. It checks for paired-tags, single-tags, processing tags,
the initial XML tag, attributes, character data (PCDATA and CDATA), basic
escaped entities, and white space within and between tags. (A
white-space-handling policy is specified in the validation stage, so the
basic parser can't know it and has to return all spaces to be safe.)

+ Namespace support; it is not in the basic parser since namespaces look
like tag names and special attributes. The basic parser doesn't need a
namespace cross-reference table. (Namespace stuff has a colon in the name;
what happens if a tag or attribute name has multiple colons? It's probably
bad from a XML & Namespace perspective, but it's still well-formed XML.)

+ Basic validation; using hand-coded classes or DTDs. DTDs have a syntax
that is _similar_ to XML, but still needs a different parser. So hand-coded
validation objects would be easier for a starting point. Eventually, we
could parse a DTD into a new type of validation object. We someday have to
include support for finding DTDs, based on their location. (If we have them
cached, we don't have to open an Internet connection.)

+ SAX & DOM; once the basic stuff is analyzed, we can start making
converters from our C++ stuff to SAX or DOM. SAX should be easier, since
it's also an event-based model. We can use the events to build DOM's
memory-tree model.

+ XPath; it describes sections of a XML file. It doesn't use XML syntax,
but some of the later stuff does use XML syntax and uses XPath descriptors
in attributes.

+ Advanced validation; use XML-Schema, which is in XML syntax, to do more
precise validation.

+ Other meta-XML; XSTL, which is in XML syntax, converts XML files into
new syntax, which may include other XML files or HTML.

On another track, everyone seems to make XML-reading classes, but nobody
seems to make XML-writing classes (except indirectly through XSTL). Is
writing XML "so easy that anyone can wing it"? Helper classes could still
be nice (making sure paired-tags are balanced, etc.).

Unicode

Hopefully, only the basic XML parser should be altered when we transition to
our Unicode library.

Taking a quick look at the Unicode web pages, version 3.1 contains 94,140
characters. If there are no gaps, we currently need a 17-bit unsigned
number to hold all values. We could use something like what's currently
being reviewed (hint, hint) in "dlw_int.zip" to find the proper built-in
unsigned type to represent Unicode characters. I had my sample library in
"dlw_uc.zip" wrap the type in a structure so we don't have to worry about
typedef clashes.

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT mac DOT com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk