Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Jeremy Maitin-Shepard (jeremy_at_[hidden])
Date: 2011-04-26 21:28:18


On 04/26/2011 02:17 AM, Sebastian Redl wrote:
> On 25.04.2011 22:31, Jeremy Maitin-Shepard wrote:
>> We can assume that the compiler knows the correct character set of the
>> source code file, as trying to fool it would seem to be inherently
>> error prone. This seems to rule out the possibility of char * literals
>> containing UTF-8 encoded text on MSVC, until C++1x Unicode literals
>> are supported.
>>
>> The biggest nuisance is that we need to know the compile-time
>> character set/encoding (so that we know how to interpret
>> "narrow" string literals), and there does not appear to be any
>> standard way in which this is recorded (maybe I'm mistaken though).
> The source character set is pretty much irrelevant. It's the execution
> character set that is problematic. A compiler will translate string
> literals in the source from the source character set to the execution
> character set for storage in the binary.
> GCC has options to control both the source (-finput-charset) and the
> execution character set (-fexec-charset). They both default to UTF-8.
> However, MSVC is more complicated. It will try to auto-detect the source
> character set, but while it can detect UTF-16, it will treat everything
> else as the system narrow encoding (usually a Windows-xxxx codepage)
> unless the file starts with a UTF-8-encoded BOM. The worse problem is
> that, except for a very new, poorly documented, and probably
> experimental pragma, there is *no way* to change MSVC's execution
> character set away from the system narrow encoding.
>
> So let's assume that further down, it's the execution set that's known.

Yes, it is the execution character set that I meant (I assumed that, as
is the case for MSVC, the execution character set is the same as the
character set given by the current locale at compile time.)
>
>> By knowing the compile-time character set, all ambiguity is removed.
>> The translation database can be assumed to be keyed based on UTF-8, so
>> to translate a message, it needs to be converted to UTF-8. There
>> should presumably be versions of the translation functions that take
>> narrow strings, wide strings, and additional versions for the C++1x
>> unicode literals once they are supported by compilers (I expect that
>> to be very soon, at least for some compilers). If a wide string is
>> specified, it will be assumed to be in UTF-16 or UTF-32 depending on
>> sizeof(wchar_t), and converted to UTF-8. UTF-32 is generally
>> undesirable, I imagine, but in practice should nonetheless work and
>> using wide strings might be the best approach for code that needs to
>> compile on both Windows and Linux. For the narrow version, if the
>> compile-time narrow encoding is UTF-8, the conversion is a no-op.
>> Otherwise, the conversion will have to be done. (The C++1x u8 literal
>> version would naturally require no conversion also.)
>
> The issue with making the narrow version automatically transcode the
> input from the narrow encoding to UTF-8 is that it is a compatibility
> issue with C++11 u8 literals. For some reason, there is no way in the
> type system to distinguish between normal narrow and u8 literals. In
> other words, if you ever make the translate() functions assume a narrow
> literal to be in the locale character set, you can't use u8 literals
> there anymore.

This is a problem with every interface that takes char * arguments, and
isn't specific to this particular case. One solution is to use a
different name, since it isn't possible to overload on type. Another
solution specific to this case is for the user to specify an execution
charset via preprocessor define of UTF-8, in which case no conversion
will be done regardless, and the user should just make sure not to use
it with non-UTF-8 narrow strings. [Aside: It seems that only MSVC users
(or users that want to write code that is portable to MSVC) will bother
to use the u8 prefix at all, since on GCC it by default has no effect,
given the default execution charset of UTF-8.]


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk