Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-18 16:18:58
On 01/18/2011 08:23 AM, Chad Nelson wrote:
> On Tue, 18 Jan 2011 16:04:29 +0000
> Alexander Lamaison<awl03_at_[hidden]> wrote:
>> On Tue, 18 Jan 2011 10:54:57 -0500, Chad Nelson wrote:
>>> Why delegate them to another library? Those classes already have
>>> efficient, flexible, and correct iterator-based template code for the
>>> conversions between the UTF-* types. I'd rather just farm out the
>>> stuff that those types are weak at, like converting to and from
>>> system-specific locales.
>> If they can do that, that's great! The conversion code was so short
>> that I assumed it wasn't a full, complete conversion algorithm.
> They're complete, and accurate. The algorithms aren't overly complex,
> they just translate between different forms of the exact same data,
> after all.
If you can assume that the encoding is correct already that's true.
Most the code to convert from utf-8 to utf-32 or utf-16, for example, is
to check that you don't have overly long encodings that cause security
issues or other violations of the well-formedness table in the unicode
spec. Otherwise, especially if you carry things around in utf-8 by
preference, and do your checking in that encoding, you open yourself up
to problems. (http://capec.mitre.org/data/definitions/80.html). If you
don't ever accept utf-8 encoded things from users, of course, you don't
have to worry about this, but I would write the conversion defensively.
I should say that I haven't read your code yet and you might very well
do this correctly. The code conversion facet used by a lot of boost
code doesn't. It was written to an older version of the spec for utf-8
and allows 5 and 6 character encodings. It does have these security
concerns. I offered awhile back to replace it, but assume that with the
locale stuff coming up for review it would be better to go with that. I
did write a replacement for utf8_codecvt_facet.cpp
utf8_codecvt_facet.hpp that could be dropped in for the use of
serialization and passes the tests in that part of boost.