Boost logo

Boost :

From: Jonathan Wakely (cow_at_[hidden])
Date: 2005-07-21 03:11:07


On Wed, Jul 20, 2005 at 12:28:40PM -0700, Robert Ramey wrote:

> FWIW I personally would like options 1 - use &#0 anyway - basically because
> it would preserve the idea that an xml_archive can do anything any other
> archive can do and doesn't ripple XML - ness back into the library or user
> programs. But even this is not so trivial. Its not clear to me whether it
> should apply to all non-printable character. This then raises the issue of
> what is non-printable in a UTF context. Then it makes me wonder what the
> "encoding" attribute in XML is for in a UTF file. This is a perfect example
> how something that seems simple at first glance turns in to a really time
> consuming issue.

You're confusing "characters that are allowed in XML" with "encoding
used to represent characters". © is a character entity which has
NOTHING to do with encoding. It represents the same character whatever
encoding your document is stored in. Similarly, � and  and other
numerical entities are not allowed in an XML document irrespective of
the encoding.

Character encoding is to do with how an XML file is stored on disk.
Whether you can have '\0' in an XML file is to do with the semantic
content of the XML document. These issues are unrelated.

>
> I've never warmed up to XML myself. I learned enough of the details to
> implement xml_?archive but I still never learned to like it. The only thing
> I've found it useful for is checking that load/save functions match. The
> xml_archive classes check that the end tag is found in the right place and
> in fact matches the start tag so any difference in the save / load functions
> throws an exception. So if I have an obscure problem I test using
> xml_archive.
>
> Other than the above, the only utility I can see for the xml_?archive is as
> some sort of bridge to the "outside world". That's why I set aside the
> original string representation - as a sequence of numbers - in favor of the
> current one - a text string. The mismatch between what std::string does and
> xml text data does is the source of the problem.

So stop using XML. If you're not going to write well-formed XML (which
means no � or  or  etc.) then why bother writing XML? XML is
verbose, inefficient and has a number of complicated details. Its main
advantage is interoperability and the availablity of compatible tools.
If you produce non-well-formed XML then you can't use any existing
tools, so you've invented your own markup langguage with most of the
drawbacks of XML and none of the advantages!

IMHO you should do is produce well-formed XML.

> I would hope that some smart person can find the sentence, in the paragraph,
> on the page, in the chapter of the relevant document which can deal with
> this is some sort of comforming way.

Either:

1) Store all strings in a hexadecimal or base64 representation. This
allows any arbitrary sequence of bytes to be mapped to a portable subset
of ASCII characters.

2) Store strings normally, unless they contain invalid characters, in
which case put the string in a <hex> or <base64> element and use
hex/base64 to store the string.

The advantage of 1) is consistency. The advantage of 2) is human
readibility for most strings - only unrepresentable ones are not human
readable.

I am completely unfamiliar with the serialization library and its XML
format. Do you turn all strings to UTF-8 ? That seems wrong to me, if
I give you a std::string with the bytes that map to a ISO-8859-1 string
do you re-encode that as UTF-8 using e.g. iconv ? What if I give you a
std::string containing bytes that map to a UTF-8 string? Do you
re-encode that? I think there is a strong argument for not doing
anything encoding-related to strings, just store the bytes exactly as
they are, unless that would produce an invalid XML doc, in which case
use hex or base64. Otherwise you impose a semantic meaning on the
bytes in a std::string that may not be present, namely "this string
contains text data that can be stored in an XML text node". C++ allows
ANY bytes in a std::string and does not require those bytes to form a
valid UTF-8 string, or a valid ASCII string, or any other restriction.

jon

-- 
"What I tell you three times is true"
	- The Hunting of the Snark

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk