Boost logo

Boost :

From: Gennadiy Rozental (gennadiy.rozental_at_[hidden])
Date: 2005-11-01 13:26:19


>> IMO, Unicode support is way beyond string template parameter. Unicode
>> means
>> different character sets to support, different encoding format, different
>> encoding schemes sets and different tradeoffs in optimization and all
>> above.
>
> Sort of. For XML processing, the primary feature of Unicode is the
> extended
> character set. For XML 1.0, once an XML processor has decided whether or
> not a
> given character is whitespace, one of the special characters (such as <,
> >,
> and &), a name start character, a name character or "other", the
> peculiarities
> of Unicode are mostly irrelevant. Obviously, there has to be code to
> handle
> the detection of the input encoding, and conversion to a stream of Unicode
> codepoints, in order to facilitate such classification. However, beyond
> that,
> the details don't matter.

I think it's more then just that.

Scenario 1: I prefer parse documents that use only first plane, use UCS2 as
encoding format and UTF8, UTF16 as Encoding scheme. IOW I will always use
wchar_t and wstring.
Scenario 2: I prefer parse documents that use only ASCII chars, use 8bit as
encoding format and 7bit as encoding scheme. IOW prefer to use char as
std::string and I do not want to know about any transcoding, wide chars
e.t.c.
Scenario 3: I prefer parse documents that use whole Unicode set, use UTF16
as encoding format and UTF8, UTF16 as Encoding scheme and I want parser to
be lazy, IOW if it is big(huge) XML document that uses UTF8, I do not want
parser to convert any CDATA immediately into native encoding form, until
requested, but only do some local char by char conversion required for
markup detection. (Essentially I want to limit memory usage and unnecessary
work)
Scenario 4: I prefer parse documents that use whole Unicode set, use UCS4 as
encoding format and support a wide variety (10 or more) different encoding
schemes. I do not care about performance and memory usage that much - but
prefer single parser that does it all.

I could list a lot of different usage schemes with different tradeoffs.
Eventually it bound to affect XML parser interface in regards to Unicode
support (instead of Unicode I would prefer to use term Charsets and Encoding
scheme sets - Unicode is just one particular charset/encoding scheme sets
combination)

Gennadiy


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk