Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-18 16:18:58


On 01/18/2011 08:23 AM, Chad Nelson wrote:
> On Tue, 18 Jan 2011 16:04:29 +0000
> Alexander Lamaison<awl03_at_[hidden]> wrote:
>
>> On Tue, 18 Jan 2011 10:54:57 -0500, Chad Nelson wrote:
>>
>>> Why delegate them to another library? Those classes already have
>>> efficient, flexible, and correct iterator-based template code for the
>>> conversions between the UTF-* types. I'd rather just farm out the
>>> stuff that those types are weak at, like converting to and from
>>> system-specific locales.
>> If they can do that, that's great! The conversion code was so short
>> that I assumed it wasn't a full, complete conversion algorithm.
> They're complete, and accurate. The algorithms aren't overly complex,
> they just translate between different forms of the exact same data,
> after all.
If you can assume that the encoding is correct already that's true.
Most the code to convert from utf-8 to utf-32 or utf-16, for example, is
to check that you don't have overly long encodings that cause security
issues or other violations of the well-formedness table in the unicode
spec. Otherwise, especially if you carry things around in utf-8 by
preference, and do your checking in that encoding, you open yourself up
to problems. (http://capec.mitre.org/data/definitions/80.html). If you
don't ever accept utf-8 encoded things from users, of course, you don't
have to worry about this, but I would write the conversion defensively.

I should say that I haven't read your code yet and you might very well
do this correctly. The code conversion facet used by a lot of boost
code doesn't. It was written to an older version of the spec for utf-8
and allows 5 and 6 character encodings. It does have these security
concerns. I offered awhile back to replace it, but assume that with the
locale stuff coming up for review it would be better to go with that. I
did write a replacement for utf8_codecvt_facet.cpp
utf8_codecvt_facet.hpp that could be dropped in for the use of
serialization and passes the tests in that part of boost.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk