Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Jeremy Maitin-Shepard (jeremy_at_[hidden])
Date: 2011-04-26 22:08:22


On 04/25/2011 11:56 PM, Artyom wrote:
>> From: Jeremy Maitin-Shepard<jeremy_at_[hidden]>
>>
>> The most significant complaint seems to be the fact that
>> the translation interface is limited to ASCII (or maybe UTF-8
>> is also supported, it isn't entirely clear).
>>
>> [snip]
>>
>> I imagine relative to the work required for the whole library,
>> these changes would be quite trivial, and might very well
>> transform the library from completely unacceptable to
>> acceptable for a number of objectors on the list,
>> while having essentially no impact on those that
>> are happy to use the library as is.
>>
>
> I can say few words on what can be done and what will never be done.
>
> I will never support wide, char16_t or char32_t strings as keys.

It seems that it is mostly possible to get the desired results using
only char * strings as keys, but there is one limitation: it is not
possible to represent strings containing characters that don't fit in a
single non-Unicode character set, e.g. it seems it would not be possible
to have a char * string literal with both Japanese and Hebrew text. As
this is unlikely to be needed, it might be a reasonable limitation, though.

However, I don't see why you are so opposed to providing additional
overloads. With MSVC currently, only wide strings can represent the
full range of Unicode. You could provide the definitions in an
alternate static/dynamic library from the char * overloads, so that
there would not even be any substantial space overhead.

>
> Current interface provides facet that has
>
>
> template<typename CharType>
> class messages_facet {
> ...
> CharType const *get(int domain_id,char const *msg) const = 0.
> ...
>
> And 2 or 4 types of it installed messages_facet<char>, messages_facet<wchar_t>,
> messages_facet<char16_t> and messages_facet<char32_t>
>
> Supporting
>
> CharType const *get(int domain_id,char const *msg) const = 0.
> CharType const *get(int domain_id,wchar_t const *msg) const = 0.
> CharType const *get(int domain_id,char16_t const *msg) const = 0.
> CharType const *get(int domain_id,char32_t const *msg) const = 0.
>
> Is just waste of memory as each source string for fastest comparison
> should be converted to 4 variants or converted in runtime... Wasteful.
>
> Thus I would only consider supporting "char const *" literals.
>
> One possibility is to provide per-domain basis a key in po file
> "X-Boost-Locale-Source-Encoding" so user would be able to specify in
> special record (which exists in all message catalogs) something
> like:
>
> "X-Boost-Locale-Source-Encoding: windows-936"
> or
> "X-Boost-Locale-Source-Encoding: UTF-8"
>
> Then when the catalog would be loaded its keys would be converted
> to the X-Boost-Locale-Source-Encoding.

This isn't a property of the message catalog, but rather a property of
the program itself, and therefore should be specified in the program,
and not in the message catalog, it would seem. Something like the
preprocessor define I mentioned would be a way to do this.

>
>
> So if you are MSVC user and you really want to have localized keys
> you have following options:
>
> Option A:
> ---------
>
> source.cpp: // without bom windows-936 encoded
>
> #pragma setlocale("Japanese_Japan.936")
>
> translate("平和"); // L"平和" works well
>
> wcout<< translate("「平和」"); // convert in runtime from cp939 to UTF-16
> cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8
> [snip]

When you say "convert in runtime", it seems you actually mean the keys
will be converted from UTF-8 to cp939 when the messages are loaded, but
the values will remain UTF-8. Untranslated strings would have to be
converted, I suppose.

> Option B:
> ---------
>
> source.cpp: // with BOM UTF-8 encoded, still windows-936 locale
>
> #pragma setlocale("Japanese_Japan.936")
>
> translate("平和"); // MSVC would be actually cp936
> // L"平和" works well
>
> wcout<< translate("「平和」"); // convert in runtime from cp939 to UTF-16
> cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8
> [snip]

Okay, same as Option A, except that it is possible to specify wide
literals using the full range of Unicode characters, rather than being
limited to the local charset.

>
> Option C (in future C++11):
> ---------
>
> source.cpp: // with BOM UTF-8 encoded
>
> translate(u8"平和"); // Would be utf-8
> // L"平和" works well
> wcout<< translate(u8"「平和」"); // convert in runtime from UTF-8 to UTF-16
> cout<< translate(u8"「平和」"); // convert just copy to the stream as is

Clearly this is a good solution, if only it were supported.

>
> Option D (works now):
> ---------
>
> source.cpp: // without BOM, UTF-8 encoded
>
> translate("平和"); // MSVC would convert use it as UTF-8
> // L"平和" does not works!!
> [snip]

I think it is obvious this isn't a feasible solution, as this breaks
wide string literals, which are likely to be needed by anyone using MSVC.

>
> wcout<< translate("「平和」"); // convert in runtime from UTF-8 to UTF-16
> cout<< translate("「平和」"); // convert just copy to the stream as is
>
>
> myprogram.po:
> msgid ""
> msgstr ""
> "Content-Type: charset=UTF-8\n"
> # it would assume UTF-8 sources
>
> msgid "平和"
> msgstr "שלום"
>
> # not translated
> msgid "「平和」"
> msgstr ""
>
> This can be done and I can implement it.
> But do not expect anything beyond this.
>
> Also note that converting a message from cp936 to for example
> windows-1255 (Hebrew narrow windows encoding) would swap out all
> non-ASCII characters...

I'm not exactly sure why a conversion like this might happen, and also
it is not clear that is a serious problem. (Likely the Hebrew speaker
would not be able to read Japanese anyway.)

>
> But this is developer's problem who had chosen to use non-ASCII
> keys.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk