Boost :

Date view	Thread view	Subject view	Author view

From: Robert Ramey (ramey_at_[hidden])
Date: 2005-09-29 23:33:38

Next message: Tushar: "Re: [boost] Reason for Java's Success (was: GUI)"
Previous message: Robert Ramey: "Re: [boost] [serialization] A question about the implementation ofXML_name in basic_xml_oarchive.ipp"
In reply to: Simon Buchan: "Re: [boost] [serialization]A question about the implementation of XML_name in basic_xml_oarchive.ipp"

This is on the right track - but not quite there. This function checks
characters in xml_tag. Surely the tag can't contain a tag.

Note that there is another problem which as already come up. An stll:string
can contain a character such as a '\0' while an xml file cannot. When I was
making this xml_archive implementation I did considert making all sthe
string variables as hex character data. I rejected this for a couple of
reasons.

There are a bunch of other issues here. Many of which I forget. It has
much to do with the following and how they interact and map (or don't map)
to each other.

utf8
utf16
codecvt_facets
wchar
char
current(global) locale
xml characters which are legal/illegal
stl:string legal characters - which is all of them but doesn't match the
above

etc.

I made some compromises to push things through to the end. If one wants the
best xml representation my advice is to use xml_w?archive. This saves and
loads utf-8. On gcc platforms this is mapped to 32(!) bit characters, on
other machines to 16 bit characters. When saving / loading to stl:string
variables these are mapped to the multi-byte character set of the current
locale, which may be different from the locale under which the archive was
created. So if you want to be really safe, use stl::wstring throughout your
code and besure that you don't have any prohibited characters in these
strings.

Robert Ramey

P.S. EBCDIC

On an ebcic machine I would expect that following to be necessary.

one's xml_archive is in utf-8 or utf16. The stream would be opened with a
codecvt facet that maps utf8 or ascii, or whatever to ebcdic. The routine
in question should be altered to function with the local character set be it
ascii or ebcdic.

Simon Buchan wrote:
> Martin Bonner wrote:
>> On 9/28/05, David Abrahams <dave_at_[hidden]> wrote:
>>
>>>>> Hmm... Also, is the apparent dependency on ASCII encoding truly
>>>>> portable?
>>
>>
>> Caleb Epstein wrote:
>>
>>>> Doubtful. Wouldn't testing for std::isalnum || '-' || '_' be a
>>>> better idea? Perhaps not quite as performant (once the lookup table
>>>> was made static), but certainly more portable and simpler to read.
>>
>>
>> Simon Buchan wrote:
>>
>>> In most implementations, the is*()'s are implemented using exactly
>>> the same method.
>>
>>
>> Yes, but the table will be different on an EBCDIC implementation
>> than they are on an ASCII implementation. The point is that the
>> specified table hard codes ASCII, so when somebody runs it on an IBM
>> mainframe it will give the wrong answer.
>>
> This may be irrelavant anyway:
> http://www.w3.org/TR/REC-xml/#charsets
> 2.2 Characters
>
> [Definition: A parsed entity contains text, a sequence of characters,
> which may represent markup or character data.] [Definition: A
> character
> is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC
> 10646]. Legal characters are tab, carriage return, line feed, and the
> legal characters of Unicode and ISO/IEC 10646. The versions of these
> standards cited in A.1 Normative References were current at the time
> this document was prepared. New characters may be added to these
> standards by amendments or new editions. Consequently, XML processors
> MUST accept any character in the range specified for Char. ]
> Character Range
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
> [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,
> excluding
> the surrogate blocks, FFFE, and FFFF. */
>
> The mechanism for encoding character code points into bit patterns MAY
> vary from entity to entity. All XML processors MUST accept the UTF-8
> and UTF-16 encodings of Unicode 3.1 [Unicode3]; the mechanisms for
> signaling which of the two is in use, or for bringing other encodings
> into play,
> are discussed later, in 4.3.3 Character Encoding in Entities.
>
> The interesting bit is the last sentance. (I didn't want to take it
> out
> of context)
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost

Next message: Tushar: "Re: [boost] Reason for Java's Success (was: GUI)"
Previous message: Robert Ramey: "Re: [boost] [serialization] A question about the implementation ofXML_name in basic_xml_oarchive.ipp"
In reply to: Simon Buchan: "Re: [boost] [serialization]A question about the implementation of XML_name in basic_xml_oarchive.ipp"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk