Boost logo

Boost :

Subject: Re: [boost] boost utf-8 code conversion facet has security problems
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2010-10-18 18:11:51


  On 10/18/2010 12:36 AM, Sebastian Redl wrote:
> ...elision by patrick...
>> shall be "an integer constant of the form yyyymmL
>> (for example, 199712L), intended to indicate that
>> values of type wchar_t are the coded representations
>> of the characters defined by ISO/IEC 10646, along
>> with all amendments and technical corrigenda as of
>> the specified year and month." Of course Microsoft
>> isn't able to define that, since you can't hold 20
>> bits in a 16 bit data type.
> Then that implies that it can only hold UCS2. That's
> a choice. In C99, the type wchar_t is officially
> intended to be used only for 32-bit ISO 10646 values,
> independent of the currently used locale. C99
> subclause 6.10.8 specifies that the value of the
> macro __STDC_ISO_10646__
>
> Microsoft defines wchar_t to be a UTF-16 2-byte unit,
> screw the standards.
Does that mean that a Microsoft Visual C++ supplied
codecvt_utf8_facet<wchar_t,char,mbstate_t>would convert
to UTF-16, or UCS2? Wouldn't be that hard to make it
aware of Microsofty and do the wrong thing, but
1) Shouldn't we follow the spec
2) Wouldn't we annoy those on Windows who read specs
and expected UCS2?

I suspect it would be better to just provide the
specializations the current draft standard calls for:

template<class Elem, unsigned long Maxcode = 0x10ffff,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf8
: public codecvt<Elem, char, mbstate_t> {
// convert between UCS2 or UCS4 and utf-8
};

template<class Elem, unsigned long Maxcode = 0x10ffff,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf16
: public codecvt<Elem, char, mbstate_t> {
// convert between UCS2 or UCS4 and utf-8
};

template<class Elem, unsigned long Maxcode = 0x10ffff,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf8_utf16
: public codecvt<Elem, char, mbstate_t> {
// convert between utf-16 and utf-8 - this one works
for microsoft
};

With Elem as wchar_t, char16_t, or char32_t and mode
allowing you to specify a BOM header be generated or
consumed as well as to generate little-endian instead
of the default big-endian.

On Microsoft, the third might be specialized:

template<wchar_t, unsigned long
Maxcode=sizeof(wchar_t), codecvt_mode
Mode=(codecvt_mode)0>codecvt_utf8_utf16{stuff here}

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk