Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-19 20:23:44
On 01/19/2011 02:33 AM, Matus Chochlik wrote:
> ... elision by patrick ...
> - It is extensible, so once we have done the painful
> transition we will not have to do it again. Currently
> utf-8 uses 1-4 (or 1-6) byte sequences to encode code
The 5 and 6 byte sequences are from early versions of the utf-8 and have
known negative security implications. You should never use them in your
encoding, nor should you ever accept them as valid utf-8. The entire
unicode code space (all 2^31 codes) is encodable in 4 byte standard
compliant utf-8. Please see RFC3629 UTF-8, a transformation format of
ISO 10646. F. Yergeau. November 2003. This is also STD0063. Also see
Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the
Unicode Standard. I can't emphasize this enough. There have been real,
serious problems, that cost people money from following the older naive
> to 1-N bytes (unlike UCS-X and i'm not sure about
If you extended it, then it would not be utf-8 which is an encoding of UCS.
> even if we dig out the stargate or join the United
> Federation of Planets and captain Kirk, every time
> he returns home, brings a truckload of new writing
> scripts to support, UTF-8 will be able to handle it.
Well, most of the code space of UCS is still unused. There's plenty of
room. 2^31 codes is a lot.
> just my 0.02 strips of gold pressed latinum :)
> Best regards,
> Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk