Boost logo

Boost :

Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED
From: Ryou Ezoe (boostcpp_at_[hidden])
Date: 2011-04-19 16:54:25


On Tue, Apr 19, 2011 at 11:31 PM, Soares Chen Ruo Fei
<crf_at_[hidden]> wrote:
> Ryou Ezoe wrote:
>> What I want is translate() accept wchar_t const * and std::wstring as
>> a parameter. just like it accept char const * and std::string.
>> Then, it return the corresponding translated text.
>> Although the encoding of wchar_t is unspecified in the Standard.
>> In the current MS-Windows environment, it should be treated as UTF-16.
>>
>> Converting it to UTF-8 is a implementation details.
>> I don't care which UTF it internally use.
>> As long as it support real UCS(all code points defined in UCS)
>>
>> But treating it as UCS rather than binary string is better.
>>
>> Assuming we have C++0x compiler and encoding of wchar_t is UTF-16,
>> translate(u8"text"), translate(u"text"), translate(U"text") and
>> translate(L"text")
>> all returns the same mapped translated text according to the locale.
>> This is a good.
>
> I suppose that you are probably fine with the requirement that the
> supplied text must be in one of the Unicode encodings, because
> otherwise translating from text in shift-JIS or arbitrary encodings is
> probably be a mess from a technical perspective.
>
> I think that what we really need is to enforce the character set used
> in Boost.Locale, not the language. It just happen that Artyom chose
> the ASCII character set which don't support most other languages. I
> don't see any technical reasons to enforce the language used for
> translating, but there are many technical reasons to enforce a
> particular encoding. We can just change the encoding used from ASCII
> to UCS, and that wouldn't technically make much difference. The only
> problem for using Unicode as the translation key is the normalization
> issues. Since normalization is too heavyweight, the translation system
> should probably operate at code point level, though translations of
> identical original text with different code points will then fail.

I don't expect perfect normalization.
I think it's not possible.
I just want libraries to be UCS aware.

>
> I have one suggestion to overcome GNU Gettext's limitation. Perhaps we
> can automatically convert the text into Unicode escaped sequences
> before passing to GNU Gettext, so "日本語" in UTF-8 will become
> "\\u65E5\\u672C\\u8A9E" in ASCII.

Why do you need to escaped it?
Why do you want to stick with ASCII?

UCS and its encoding UTF-8, UTF-16, UTF-32 will be specified in
upcoming C++0x standard.
On the other hand, standard still does not say ASCII.
The basic source character set does not cover all ASCII characters.
So using ASCII is not portable.

> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- 
Ryou Ezoe

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk