Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Jeremy Maitin-Shepard (jeremy_at_[hidden])
Date: 2011-04-27 18:32:18

On 04/27/2011 03:11 PM, Mathias Gaunard wrote:
> On 27/04/2011 21:42, Jeremy Maitin-Shepard wrote:
>> Why not simply provide a compile-time or run-time option to allow the
>> user to specify the following:
>> - encoding of narrow keys to be given as char * arguments, or specify
>> that none is to be supported (in which case narrow keys cannot be used
>> at all), the default being UTF-8;
>> - whether wchar_t * arguments are supported (the encoding will be
>> assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by
>> default, not supported]
>> - whether char16_t arguments are supported [by default, not supported]
>> - whether char32_t arguments are supported [by default, not supported]
>> The library would simply convert the UTF-8 encoded keys in the message
>> catalogs to each of the supported key argument encodings. In most cases,
>> there would only be a single supported encoding. Because the narrow
>> version could be disabled, with Japanese text and UTF-16 wchar_t, this
>> would actually _save_ space since UTF-16 is more efficient than UTF-8
>> for encoding Japanese text.
> Why is it so complicated?
> User gives string and says what encoding it is in, the library converts
> to the catalog encoding and looks it up, then returns the localized
> string, converting again if needed.
> Unlike what Artyom said earlier, converting a string does not
> necessarily require dynamic memory allocation, and localization is not
> particularly performance critical anyway.

It may often not be performance critical. In some cases, it might be
though. Consider the case of a web server, where the work done by the
web server machines themselves may essentially just consist of pasting
together strings from various sources. (There is possibly a separate
database server, etc.) This is also precisely the use case for which
Artyom designed the library, I think. In this setting it is fairly
clear why converting the messages once when loaded is better than doing
it when needed.

> If that runtime conversion is a concern, it's also possible to do that
> at compile time, at least with C++0x (syntax is ugly in C++03).

Maybe it can be done, but I don't think it is a viable possibility.

> Actually, I fail to understand what the problem is.
> Is it just the MSVC BOM problem? I think it should be handled by the
> build system.
>> I agree that it is very unfortunate that wchar_t can mean either UTF-16
>> or UTF-32 depending on the platform
> How is that unfortunate? You can tell which one depending on the size of
> wchar_t.

It is unfortunate simply because it is not uniform, even though it is
possible to work around that, and furthermore, it is unfortunate because
UTF-32 is generally not wanted.

>> but in practice the same source
>> code containing L"" string literals can be used on both Windows and
>> Linux to reliably specify Unicode string literals (provided that care is
>> taken to ensure the compiler knows the source code encoding). The fact
>> that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient
>> does in some ways make render Linux a second-class citizen if a solution
>> based on wide string literals is used for portability, but using UTF-8
>> on MSVC is basically just impossible, rather than merely less efficient,
>> so there doesn't seem to be another option. (Assuming you are unwilling
>> to rely on the Windows "ANSI" narrow encodings.)
> You can always use a macro USTRING("foo") that expands to u8"foo" or
> u"foo" on systems with unicode string literals and L"foo" elsewhere.

You can, but it adds complexity, etc...

Boost list run by bdawes at, gregod at, cpdaniel at, john at