Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Jeremy Maitin-Shepard (jeremy_at_[hidden])
Date: 2011-04-27 18:32:18


On 04/27/2011 03:11 PM, Mathias Gaunard wrote:
> On 27/04/2011 21:42, Jeremy Maitin-Shepard wrote:
>
>> Why not simply provide a compile-time or run-time option to allow the
>> user to specify the following:
>>
>> - encoding of narrow keys to be given as char * arguments, or specify
>> that none is to be supported (in which case narrow keys cannot be used
>> at all), the default being UTF-8;
>>
>> - whether wchar_t * arguments are supported (the encoding will be
>> assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by
>> default, not supported]
>>
>> - whether char16_t arguments are supported [by default, not supported]
>>
>> - whether char32_t arguments are supported [by default, not supported]
>>
>> The library would simply convert the UTF-8 encoded keys in the message
>> catalogs to each of the supported key argument encodings. In most cases,
>> there would only be a single supported encoding. Because the narrow
>> version could be disabled, with Japanese text and UTF-16 wchar_t, this
>> would actually _save_ space since UTF-16 is more efficient than UTF-8
>> for encoding Japanese text.
>
> Why is it so complicated?
>
> User gives string and says what encoding it is in, the library converts
> to the catalog encoding and looks it up, then returns the localized
> string, converting again if needed.
>
> Unlike what Artyom said earlier, converting a string does not
> necessarily require dynamic memory allocation, and localization is not
> particularly performance critical anyway.

It may often not be performance critical. In some cases, it might be
though. Consider the case of a web server, where the work done by the
web server machines themselves may essentially just consist of pasting
together strings from various sources. (There is possibly a separate
database server, etc.) This is also precisely the use case for which
Artyom designed the library, I think. In this setting it is fairly
clear why converting the messages once when loaded is better than doing
it when needed.

>
> If that runtime conversion is a concern, it's also possible to do that
> at compile time, at least with C++0x (syntax is ugly in C++03).

Maybe it can be done, but I don't think it is a viable possibility.

>
> Actually, I fail to understand what the problem is.
> Is it just the MSVC BOM problem? I think it should be handled by the
> build system.
>
>
>> I agree that it is very unfortunate that wchar_t can mean either UTF-16
>> or UTF-32 depending on the platform
>
> How is that unfortunate? You can tell which one depending on the size of
> wchar_t.

It is unfortunate simply because it is not uniform, even though it is
possible to work around that, and furthermore, it is unfortunate because
UTF-32 is generally not wanted.

>
>
>> but in practice the same source
>> code containing L"" string literals can be used on both Windows and
>> Linux to reliably specify Unicode string literals (provided that care is
>> taken to ensure the compiler knows the source code encoding). The fact
>> that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient
>> does in some ways make render Linux a second-class citizen if a solution
>> based on wide string literals is used for portability, but using UTF-8
>> on MSVC is basically just impossible, rather than merely less efficient,
>> so there doesn't seem to be another option. (Assuming you are unwilling
>> to rely on the Windows "ANSI" narrow encodings.)
>
> You can always use a macro USTRING("foo") that expands to u8"foo" or
> u"foo" on systems with unicode string literals and L"foo" elsewhere.

You can, but it adds complexity, etc...


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk