|
Boost : |
From: Robert Ramey (ramey_at_[hidden])
Date: 2005-07-20 14:28:40
This is a great email. It illustrates why I tend to drag my feet on things
like this. This is not going to be addressed right away so feel free to
investigate and discuss it.
FWIW I personally would like options 1 - use � anyway - basically because
it would preserve the idea that an xml_archive can do anything any other
archive can do and doesn't ripple XML - ness back into the library or user
programs. But even this is not so trivial. Its not clear to me whether it
should apply to all non-printable character. This then raises the issue of
what is non-printable in a UTF context. Then it makes me wonder what the
"encoding" attribute in XML is for in a UTF file. This is a perfect example
how something that seems simple at first glance turns in to a really time
consuming issue.
I've never warmed up to XML myself. I learned enough of the details to
implement xml_?archive but I still never learned to like it. The only thing
I've found it useful for is checking that load/save functions match. The
xml_archive classes check that the end tag is found in the right place and
in fact matches the start tag so any difference in the save / load functions
throws an exception. So if I have an obscure problem I test using
xml_archive.
Other than the above, the only utility I can see for the xml_?archive is as
some sort of bridge to the "outside world". That's why I set aside the
original string representation - as a sequence of numbers - in favor of the
current one - a text string. The mismatch between what std::string does and
xml text data does is the source of the problem.
I would hope that some smart person can find the sentence, in the paragraph,
on the page, in the chapter of the relevant document which can deal with
this is some sort of comforming way.
Good Luck
Robert Ramey
Eelis van der Weegen wrote:
> Jonathan Wakely wrote:
>> It's not a valid entity, using it means your XML is
>> not well-formed. It doesn't matter whether you say � or �
>> (the decimal and hexadecmial forms are exactly equivalent - but 0 is
>> still not a validnumerical entity.)
>
> Yes, in XML 1.1, the null character is a special case by itself;
> ordinary nonprintable characters can be embedded as numerical
> character references, but the null character cannot (see the "Legal
> Character" well-formedness constraint for production 66).
>
>> As long as you can read the same data back and restore the same
>> sequence of bytes it doesn't really matter.
>
> I strongly agree with Robert that further processing of generated XML
> archives by external tools is one of the main strengths of XML
> archives and should be the main concern when evaluating our options
> when it comes to dealing with this problem. That said, I see the
> following options:
>
> 1. Use � anyway.
>
> I've googled around a bit and found that �'s being generated by
> one tool in a toolchain and rejected by the next is a reasonably
> common problem, so I don't really like this option.
>
> 2. Encode it using some escape sequence: <foo>bar\0bas</foo>
>
> This would introduce an extra grammar layer that software used for
> further processing must parse.
>
> 3. Encode it using a dedicated element:
> <foo>bar<serialization:null/>bas</foo>
>
> This seems like a reasonable way to encode null characters, but
> wouldn't work in attribute values.
>
> 4. Encode strings containing null characters using binary encodings
> such as those defined by XML Schema's data types:
>
> http://www.w3.org/TR/xmlschema-2/#base64Binary
> http://www.w3.org/TR/xmlschema-2/#hexBinary
>
> This would require some additional flag that indicates whether a
> string is encoded textually or binary (unless of course all strings
> are encoded this way, but then we'd lose the human-readability of
> strings in XML archives).
>
> 5. Disallow serialization of std::(w)strings that contain null
> characters to XML archives.
>
> This is my personal favorite. XML's normal character data is simply
> inherently textual and not suited to storing binary data containing
> null characters. We shouldn't try to hack around this. Doing so would
> only make things complicated in further external processing. If users
> insist on storing binary fragments in their XML archives they can
> always resort to vector<char> (by the way, the binary encodings I
> mentioned above might be very nice for storing things like
> vector<char> efficiently).
>
> Eelis
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk