Boost logo

Boost :

From: Greg Colvin (gcolvin_at_[hidden])
Date: 2001-09-26 20:23:36


Before we reinvent the XML parser wheel perhaps we should look at
the C++ version of Xerces:

http://xml.apache.org/xerces-c/index.html

From: Daryle Walker <darylew_at_[hidden]>
> You use some spare money to start reading up on XML, and look what
> happens....
>
> 1. Regression
>
> I suggested using XML as the output for the regression tests. We could do a
> quick-and-dirty printing of XML code with the current HTML code, but I don't
> think it's worth it. Eventually, we could output just XML and use a XSLT
> conversion to HTML, but that's a long way off.
>
> 2. XML
>
> For now we can assume that the XML files just contain the basic C++
> characters. This is convenient for the popular modern platforms, since they
> all grok ASCII (but not necessarily Latin-1 or Unicode) and the basic C++
> characters are an ASCII subset. People with EBSDIC will be able to use the
> initial library too, but will get problems when we start the Unicode
> transition.
>
> We could use an event-based model, so we don't have to be limited by memory.
> Also, a memory-tree model can be built over an event-based model, but not
> the other way around.
>
> For the parser, we can go in steps:
>
> + Basic parser; this is the step that a "beginning CS student can do in a
> week." All it does is reads a stream and returns the pieces and checks for
> well-formed files. It checks for paired-tags, single-tags, processing tags,
> the initial XML tag, attributes, character data (PCDATA and CDATA), basic
> escaped entities, and white space within and between tags. (A
> white-space-handling policy is specified in the validation stage, so the
> basic parser can't know it and has to return all spaces to be safe.)
>
> + Namespace support; it is not in the basic parser since namespaces look
> like tag names and special attributes. The basic parser doesn't need a
> namespace cross-reference table. (Namespace stuff has a colon in the name;
> what happens if a tag or attribute name has multiple colons? It's probably
> bad from a XML & Namespace perspective, but it's still well-formed XML.)
>
> + Basic validation; using hand-coded classes or DTDs. DTDs have a syntax
> that is _similar_ to XML, but still needs a different parser. So hand-coded
> validation objects would be easier for a starting point. Eventually, we
> could parse a DTD into a new type of validation object. We someday have to
> include support for finding DTDs, based on their location. (If we have them
> cached, we don't have to open an Internet connection.)
>
> + SAX & DOM; once the basic stuff is analyzed, we can start making
> converters from our C++ stuff to SAX or DOM. SAX should be easier, since
> it's also an event-based model. We can use the events to build DOM's
> memory-tree model.
>
> + XPath; it describes sections of a XML file. It doesn't use XML syntax,
> but some of the later stuff does use XML syntax and uses XPath descriptors
> in attributes.
>
> + Advanced validation; use XML-Schema, which is in XML syntax, to do more
> precise validation.
>
> + Other meta-XML; XSTL, which is in XML syntax, converts XML files into
> new syntax, which may include other XML files or HTML.
>
> On another track, everyone seems to make XML-reading classes, but nobody
> seems to make XML-writing classes (except indirectly through XSTL). Is
> writing XML "so easy that anyone can wing it"? Helper classes could still
> be nice (making sure paired-tags are balanced, etc.).
>
> Unicode
>
> Hopefully, only the basic XML parser should be altered when we transition to
> our Unicode library.
>
> Taking a quick look at the Unicode web pages, version 3.1 contains 94,140
> characters. If there are no gaps, we currently need a 17-bit unsigned
> number to hold all values. We could use something like what's currently
> being reviewed (hint, hint) in "dlw_int.zip" to find the proper built-in
> unsigned type to represent Unicode characters. I had my sample library in
> "dlw_uc.zip" wrap the type in a structure so we don't have to worry about
> typedef clashes.
>
> --
> Daryle Walker
> Mac, Internet, and Video Game Junkie
> darylew AT mac DOT com
>
>
> Info: http://www.boost.org Unsubscribe: <mailto:boost-unsubscribe_at_[hidden]>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk