Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Jeremy Maitin-Shepard (jeremy_at_[hidden])
Date: 2011-04-27 15:42:56
On 04/27/2011 12:07 AM, Artyom wrote:
> How the catalog works. It searches the key in the hash table,
> as the last stage it compares the strings bytewise.
> It is fast and efficient.
> In order to support both L"", "", u"" and U"" I need to
> create a 4 variants of same string to make sure it works
> fast (waste of memory) or I need to convert the
> string from UTF-16/32 to UTF-8 that is run-time
> memory allocation and conversion.
> So no, I'm not going to do this, especially that
> it is nor portable enough.
Why not simply provide a compile-time or run-time option to allow the
user to specify the following:
- encoding of narrow keys to be given as char * arguments, or specify
that none is to be supported (in which case narrow keys cannot be used
at all), the default being UTF-8;
- whether wchar_t * arguments are supported (the encoding will be
assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by
default, not supported]
- whether char16_t arguments are supported [by default, not supported]
- whether char32_t arguments are supported [by default, not supported]
The library would simply convert the UTF-8 encoded keys in the message
catalogs to each of the supported key argument encodings. In most
cases, there would only be a single supported encoding. Because the
narrow version could be disabled, with Japanese text and UTF-16 wchar_t,
this would actually _save_ space since UTF-16 is more efficient than
UTF-8 for encoding Japanese text.
More to the point, you as a library author can offer this functionality
(since it shouldn't be too much of an implementation burden) even if you
as a user of your library wouldn't want to use it (because you are happy
to provide English string literals).
I agree that it is very unfortunate that wchar_t can mean either UTF-16
or UTF-32 depending on the platform, but in practice the same source
code containing L"" string literals can be used on both Windows and
Linux to reliably specify Unicode string literals (provided that care is
taken to ensure the compiler knows the source code encoding). The fact
that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient
does in some ways make render Linux a second-class citizen if a solution
based on wide string literals is used for portability, but using UTF-8
on MSVC is basically just impossible, rather than merely less efficient,
so there doesn't seem to be another option. (Assuming you are unwilling
to rely on the Windows "ANSI" narrow encodings.)
>>> One possibility is to provide per-domain basis a key in po file
>>> "X-Boost-Locale-Source-Encoding" so user would be able to specify in
>>> special record (which exists in all message catalogs) something
>>> "X-Boost-Locale-Source-Encoding: windows-936"
>>> "X-Boost-Locale-Source-Encoding: UTF-8"
>>> Then when the catalog would be loaded its keys would be converted
>>> to the X-Boost-Locale-Source-Encoding.
>> This isn't a property of the message catalog, but rather a property of
>> the program itself, and therefore should be specified in the program,
>> and not in the message catalog, it would seem. Something like the
>> preprocessor define I mentioned would be a way to do this.
> Two problem with define that I want
> translate("foo") to work automatically and not being a define.
> So I either need to provide an encoding in catalog itself
> or when I provide domain name (the reason it is done
> per domain name as one part of the project may use UTF-8 and
> other cp936 and other may use US-ASCII at all)
> So I can either specify it when I load a catalog or in
> catalog itself.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk