Boost logo

Boost :

Subject: Re: [boost] boost utf-8 code conversion facet has security problems
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2010-10-15 18:23:50


  On 10/15/2010 02:18 AM, Artyom wrote:
> Actually I want to mention that UTF-8 codecvt facet implementation
> has several other problems:
>
> 1. When sizeof(wchar_t)==2 it supports only UCS-2 and not full UTF-16
In general that would have to be true of any
implementation, wchar_t IS a wide character, and not an
encoding scheme. According to the spec wchar_t is
supposed to represent the internal character set of the
program as a wide character. (From a recent C++ draft
standard N3126):

3.9.1Fundamental types
...
5 Type wchar_t is a distinct type whose values can
represent distinct codes for all members of the largest
extended character set specified among the supported
locales (22.3.1).

In 16 bits you could only support UCS2 in a single wide
character (and current C++ specs clearly state that).
That means it would be a mistake in that environment to
support any locale that needed characters outside the
Unicode plane 0, codes U+0000-U+FFFF, also known as the
Basic Multilingual Plane (BMP). In effect, if a
compiler writer chose to implement a 16-bit wchar_t,
they would be choosing not to support any locale that
needed codes from any plane outside the BMP. That
would be silly since Unicode says the following about
plane 1, codes U+10000-U+1FFFF, the Supplementary
Multilingual Plane (SMP). (From the Unicode Standard
version 5.2):

The majority of scripts currently identified for
encoding will eventually be allocated in the SMP. As a
result, some areas of the SMP will experience common,
frequent usage.

So right now you'd do ok with a 16-bit wchar_t holding
UCS2 codes in most parts of the world, but going
forward, no. (Too bad a 17-bit wchar_t doesn't make
sense, it would just hold the BMP and the SMP.) Just
because it's silly doesn't mean no one would do it of
course. Full UTF-16 requires 2-16 bit codes for the
codes in the supplementary or higher planes, so won't
fit in a 16-bit wchar-t. Support of the recent C++
drafts requires a char32_t basic type anyway, so I
can't imagine anyone using a 16-bit wchar_t going
forward, nevertheless, my code notes the precense of a
16-bit wchar_t and returns an encoding error in do_in()
as required by the C++ spec, if a utf-8 sequence would
overflow it.

I'd like to see support for the same 3 required by
recent drafts, named (as in the spec), codecvt_utf8
(one of UCS2 or USC4 to utf-8), codecvt_utf16 (one of
UCS2 or UCS4 to utf-16), and codecvt_utf8_utf16 (utf-16
to utf-8) which explicitly state the two encodings:

22.4.1.4 Class template codecvt
...
3 ... codecvt<char, char, mbstate_t> implements a
degenerate conversion; it does not convert at all. The
specialization codecvt<char16_t, char, mbstate_t>
converts between the UTF-16 and UTF-8 encodings
schemes, and the specialization codecvt <char32_t,
char, mbstate_t> converts between the UTF-32 and UTF-8
encodings schemes. codecvt<wchar_t,char,mbstate_t>
converts between the native character sets for narrow
and wide characters.

I find the last ambiguous, sounding like it would just
be a conversion between single chars and single
wchar_ts, maybe between a locale specified ISO encoding
like ISO-8859-? and a UCS wchar_t but that's not what
they mean-I think. If you look in the locale.stdcvt
section they say:

22.5 Standard code conversion facets
...
4 For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte
sequences and UCS2 or UCS4 (depending on the size of
Elem) within the program.

clearly saying that it's a conversion to UCS2 for
wchar_t of 16-bit or UCS4 for wchar_t of 32-bit.

Future libstdc++ libraries will provide these anyway,
but won't be available everywhere for quite awhile.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk