Boost logo

Boost :

From: Mattias Flodin (flodin_at_[hidden])
Date: 2002-04-07 02:27:53


On Sat, Apr 06, 2002 at 09:17:47PM -0800, Sean Parent wrote:
> On this topic, the text handling in the entire C++ standard is seriously
> broken for international text. Most compilers treat wchar_t as 16 bits which
> isn't sufficient for full unicode text handlings (just the UCS-2 subset).
> Even Unicode is a bit insufficient these days given China's threat of
> denying sale to applications that don't support GB-18030 which hasn't been
> fully incorporated into the Unicode standard - and which also require 32 bit
> for full support.
>
> What I would like to see happen (this is just a rough idea - I haven't taken
> the time to write this up as any kind of proposal) is a new string class
> template added that can handle multi-word encoded text. Untagged strings
> would be assumed to be encoded in UTF-8 (for string, a multi-word encoding
> that's a superset of 7 bit ASCII) or UTF-16 (for wstring, a multi-word
> encoding that supersets UCS-2). Processing could then be encoding - aware
> dealing with linguistic characters instead of bytes.

Right, all these things are important issues that should be explored since C++
already took a step down this road when wchar_t, char_traits and locales were
added. For lexical cast however, the conversion from and to integers does not
require awareness of multi-byte or multi-word encodings; it should be able to
treat the data as a string of fixed-size elements. The goal is to represent
numbers like "123" in the encoding of the string, not to try to translate it
into e.g. japanese words like "hiyakunijusan." To my knowledge, no encoding
uses anything fancy for standard digits like '1', '2', '3'.

As far as basic_string goes, it can be used for a lot of international
applications as-is, since you can often do without advanced text processing
that has to process each character individually. I might for instance use it
in an instant-messaging program, where the user can enter messages in his
native language. I would store it in a string and then send to the
destination without worrying about any multi-byte encodings in the string.
But, if I were to use lexical_cast for presenting some number (e.g. ping
time) to the user, it would come out as unintelligible characters because
lexical_cast treats the string as 7-bit ASCII.

Given the state of things, it would probably be impossible to make
basic_string support multi-word encoding better, since iirc there are strict
requirements on lookup time-complexity for the subscript operator and other
operations. If not a special class was written for the purpose (ropes?), one
way to better support it could be with special iterator adaptors for
different encodings, e.g. utf8_iterator(mystring.begin()) would create an
iterator that will step properly over multi-byte elements.

/Mattias

-- 
Mattias Flodin <flodin_at_[hidden]>  -  http://www.cs.umu.se/~flodin/
Room NADV 102
Department of Computing Science
Umeå University
S-901 87 Umeå, Sweden
--
"Grove giveth and Gates taketh away. " -- Bob Metcalfe (inventor of Ethernet)
on the trend of hardware speedups not being able to keep up with software
demands.

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk