Boost logo

Boost :

Subject: Re: [boost] boost utf-8 code conversion facet has security problems
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2010-10-18 02:07:23


  On 10/16/2010 06:10 AM, Sebastian Redl wrote:
> On 16.10.2010, at 00:23, Patrick Horgan wrote:
>
>> Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward,
> There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it.
Then that implies that it can only hold UCS2. That's a
choice. In C99, the type wchar_t is officially
intended to be used only for 32-bit ISO 10646 values,
independent of the currently used locale. C99
subclause 6.10.8 specifies that the value of the macro
__STDC_ISO_10646__
shall be "an integer constant of the form yyyymmL (for
example, 199712L), intended to indicate that values of
type wchar_t are the coded representations of the
characters defined by ISO/IEC 10646, along with all
amendments and technical corrigenda as of the specified
year and month." Of course Microsoft isn't able to
define that, since you can't hold 20 bits in a 16 bit
data type.

Both C and C++ in their current draft standards require
a wchar_t to be a data type whose range of values can
represent distinct codes for all members of the largest
extended character set specified among the supported
locales. It takes 20 bits to hold all of UCS4. That
would make Visual C++ non-compliant with the current
draft standards. Perhaps that will be removed or
fudged with another macro before final vote.

Of course, this is why the new standards have explicit
char16_t and char32_t, because of the impossibility of
using wchar_t. Nicely, the new types are defined to be
unsigned:) I mislike signatures that use char, since
it's the only int type that is explicitly allowed to be
either signed or unsigned. Of course dealing with
conversions you always need unsigned because the
effects of sign extension are startling at the least,
and erroneous at the worst. Wish they'd specified
originally, std:codecvt<wchar_t, unsigned char,
std::mbstat_t>.

The current Unicode standard, 5.2, notes that there are
places where wchar_t are only 8-bits and suggests that
only char16_t, for UCS2 and char32_t, for UCS2 or UCS4
be used by programmers going forward. Unfortunately,
the signature std::codecvt<wchar_t, char,
std::mbstate_t> has to be dealt with. As required by
the specs, I note if wchar_t is 16-bit and return error
whenever do_in() or do_length() is asked to decode any
utf-8 that would yield a code greater than U+FFFF.
That's the right thing to do.

UCS2 DOES support approximately the whole world's
scripts right now, but there are things in the
supplemental plane, like musical symbols, that people
like me like, as there are all the ancient scripts, and
going forward, Unicode plans to put codes for most
scripts awaiting encoding in the supplemental plane.
It's all a mess and quite frustrating. I really wish
the new C++ Standard had deprecated
std::codecvt<wchar_t, char, std::mbstate_t> and
encouraged all to use in its place,
std::codecvt<char32_t, char, std::mbstate_t>. Their
job must be hard. I say, just throw it away! It will
be cleaner and much more elegant, but of course they
can't, since they told people to use it before, left
wiggle room on the size of wchar_t saying only that it
had to be at least the size of a char, and they want
existing code to compile.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk