|
Boost : |
From: Jonathan Wakely (cow_at_[hidden])
Date: 2005-07-18 13:50:54
On Mon, Jul 18, 2005 at 09:48:11AM -0700, Robert Ramey wrote:
> Jonathan Wakely wrote:
> > On Mon, Jul 18, 2005 at 08:11:31AM -0700, Robert Ramey wrote:
> >
> >> Hmm, I've twiddled with the set of allowable characters from time to
> >> time on sort of an ad hoc basis. For some reason it never occured
> >> to me to actually try and find the difinitive source for this. So I
> >> suppose there are couple
> >
> > Assuming you're referring to XML, it's here:
> > http://www.w3.org/TR/REC-xml
> >
> >> of pending fine points here:
> >>
> >> a) the exact rules for what characters are legal in which part of
> >> tag names. This might not be all that obvious given that the html
> >> can be coded in wide characters then to utf-8. Also the narrow
> >> character version is coded with the current locale so that's another
> >> story.
> >
> > A character is a character, how it is encoded is irrelevent.
>
> Thanks for the link.
>
> That's not obvious to me - especially when one is using a locale specific
> character set. Maybe XML requires that that all characters be ucs-16 (or
> 32) or some such thing but as a practical matter lots of people are still
> using locale-specific types for strings. So its not obvious what the
> implications are of including a '\0' as part of text string in and xml
> archive. This is one of those things that seemed simple when I started but
> ran into a lot of small "gotchas' as time when on.
I agree that's a harder problem than just "can character X be used in
an element name" :-)
The '\0' character is not valid anywhere in XML, in any encoding. I
don't know the reasoning but it means you have to use some kind of
alternative representation for data that could contain NULs.
If you're talking text strings with embedded NULs then you might need to
define an entity that can stand in for the NUL, so you can expand it
back to NUL when you recreate the string from the XML archive, or put
all strings that might contain NULs in an element like <hex> and
hex-encode the bytes. There might be other solutions too, but I've not
used them.
jon
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk