Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2011-04-26 05:17:01


On 25.04.2011 22:31, Jeremy Maitin-Shepard wrote:
> We can assume that the compiler knows the correct character set of the
> source code file, as trying to fool it would seem to be inherently
> error prone. This seems to rule out the possibility of char *
> literals containing UTF-8 encoded text on MSVC, until C++1x Unicode
> literals are supported.
>
> The biggest nuisance is that we need to know the compile-time
> character set/encoding (so that we know how to interpret
> "narrow" string literals), and there does not appear to be any
> standard way in which this is recorded (maybe I'm mistaken though).
The source character set is pretty much irrelevant. It's the execution
character set that is problematic. A compiler will translate string
literals in the source from the source character set to the execution
character set for storage in the binary.
GCC has options to control both the source (-finput-charset) and the
execution character set (-fexec-charset). They both default to UTF-8.
However, MSVC is more complicated. It will try to auto-detect the source
character set, but while it can detect UTF-16, it will treat everything
else as the system narrow encoding (usually a Windows-xxxx codepage)
unless the file starts with a UTF-8-encoded BOM. The worse problem is
that, except for a very new, poorly documented, and probably
experimental pragma, there is *no way* to change MSVC's execution
character set away from the system narrow encoding.

So let's assume that further down, it's the execution set that's known.

> By knowing the compile-time character set, all ambiguity is removed.
> The translation database can be assumed to be keyed based on UTF-8, so
> to translate a message, it needs to be converted to UTF-8. There
> should presumably be versions of the translation functions that take
> narrow strings, wide strings, and additional versions for the C++1x
> unicode literals once they are supported by compilers (I expect that
> to be very soon, at least for some compilers). If a wide string is
> specified, it will be assumed to be in UTF-16 or UTF-32 depending on
> sizeof(wchar_t), and converted to UTF-8. UTF-32 is generally
> undesirable, I imagine, but in practice should nonetheless work and
> using wide strings might be the best approach for code that needs to
> compile on both Windows and Linux. For the narrow version, if the
> compile-time narrow encoding is UTF-8, the conversion is a no-op.
> Otherwise, the conversion will have to be done. (The C++1x u8 literal
> version would naturally require no conversion also.)

The issue with making the narrow version automatically transcode the
input from the narrow encoding to UTF-8 is that it is a compatibility
issue with C++11 u8 literals. For some reason, there is no way in the
type system to distinguish between normal narrow and u8 literals. In
other words, if you ever make the translate() functions assume a narrow
literal to be in the locale character set, you can't use u8 literals
there anymore.

Sebastian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk