Boost logo

Boost :

From: Robert Ramey (ramey_at_[hidden])
Date: 2005-07-21 10:34:14


Jonathan Wakely wrote:
> So stop using XML. If you're not going to write well-formed XML
> (which means no � or  or  etc.) then why bother writing
> XML? XML is verbose, inefficient and has a number of complicated
> details. Its main advantage is interoperability and the availablity
> of compatible tools. If you produce non-well-formed XML then you
> can't use any existing tools, so you've invented your own markup
> langguage with most of the drawbacks of XML and none of the
> advantages!

I agree - that's why I don't use it.

> IMHO you should do is produce well-formed XML.

That's what we're trying to do.

>> I would hope that some smart person can find the sentence, in the
>> paragraph, on the page, in the chapter of the relevant document
>> which can deal with this is some sort of comforming way.
>
> Either:
>
> 1) Store all strings in a hexadecimal or base64 representation. This
> allows any arbitrary sequence of bytes to be mapped to a portable
> subset of ASCII characters.

That's the way the first version worked - a lot of people were unhappy with
it.

> 2) Store strings normally, unless they contain invalid characters, in
> which case put the string in a <hex> or <base64> element and use
> hex/base64 to store the string.

A worthy suggestion.

> The advantage of 1) is consistency. The advantage of 2) is human
> readibility for most strings - only unrepresentable ones are not human
> readable.

agreed. The fundemental proble is the a std::basic string can hold data
that cannot be represented in an XML string.

> Do you turn all strings to UTF-8 ?

currently it works like this:

a) std::string are written to the xml file using the current stream locale.
Actually I use a "null" codecvt facet to work around the fact that the
standard facet molests the input/output string.

b) std:wstring are converted to UTF-8 using an stream codecvt facet.

The library would permit any codecvt facet to be used. (Hmm - this might be
the place to permit the user to insert his own decision about how to deal
with this problem. The more I think about this - the more I like it)

> I think there is a strong argument for not
> doing anything encoding-related to strings, just store the bytes
> exactly as they are, unless that would produce an invalid XML doc, in
> which case use hex or base64. Otherwise you impose a semantic
> meaning on the bytes in a std::string that may not be present, namely
> "this string contains text data that can be stored in an XML text
> node". C++ allows ANY bytes in a std::string and does not require
> those bytes to form a valid UTF-8 string, or a valid ASCII string, or
> any other restriction.

We're in agreement here as well. I very much want to maintain the
independence of the archive from the serlializaiton. This means that the
serialization of data is not in any way dependent on the type of archive to
be used.

Robert Ramey


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk