Boost logo

Boost :

From: Eelis van der Weegen (gmane_at_[hidden])
Date: 2005-07-20 12:31:11


Jonathan Wakely wrote:
> It's not a valid entity, using it means your XML is
> not well-formed. It doesn't matter whether you say � or � (the
> decimal and hexadecmial forms are exactly equivalent - but 0 is still
> not a validnumerical entity.)

Yes, in XML 1.1, the null character is a special case by itself; ordinary
nonprintable characters can be embedded as numerical character references, but
the null character cannot (see the "Legal Character" well-formedness constraint
for production 66).

> As long as you can read the same data back and restore the same sequence
> of bytes it doesn't really matter.

I strongly agree with Robert that further processing of generated XML archives
by external tools is one of the main strengths of XML archives and should be the
main concern when evaluating our options when it comes to dealing with this
problem. That said, I see the following options:

1. Use � anyway.

I've googled around a bit and found that �'s being generated by one tool in a
toolchain and rejected by the next is a reasonably common problem, so I don't
really like this option.

2. Encode it using some escape sequence: <foo>bar\0bas</foo>

This would introduce an extra grammar layer that software used for further
processing must parse.

3. Encode it using a dedicated element: <foo>bar<serialization:null/>bas</foo>

This seems like a reasonable way to encode null characters, but wouldn't work in
attribute values.

4. Encode strings containing null characters using binary encodings such as
those defined by XML Schema's data types:

   http://www.w3.org/TR/xmlschema-2/#base64Binary
   http://www.w3.org/TR/xmlschema-2/#hexBinary

This would require some additional flag that indicates whether a string is
encoded textually or binary (unless of course all strings are encoded this way,
but then we'd lose the human-readability of strings in XML archives).

5. Disallow serialization of std::(w)strings that contain null characters to XML
archives.

This is my personal favorite. XML's normal character data is simply inherently
textual and not suited to storing binary data containing null characters. We
shouldn't try to hack around this. Doing so would only make things complicated
in further external processing. If users insist on storing binary fragments in
their XML archives they can always resort to vector<char> (by the way, the
binary encodings I mentioned above might be very nice for storing things like
vector<char> efficiently).

Eelis


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk