Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-04-06 08:30:59


Hi Ferdinand,

>> I am not exactly sure if UTF-8 or UCS-4 is better as
>> universal solution, but some solution is surely needed.
>
> I am afraid there is no universal solution for all users. The easiest
> solution is based on the native basic_string<>, which is specialized for
> char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
> usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
> require another basic_string<> specialization.
>
> UCS-2 held all characters in Unicode 1.1, There was a need for more unique
> numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no
> 4-byte character specialization for basic_string<> in STL yet.

Or, to be exact, there's no agreement if wchar_t should be 32-bit or 16-bit.
Linux (or gcc specifically) uses 32 bits, and Windows 16, which means
wstring only suitable for UCS-2. Besides, UCS-2 has a mechanism to
represent characters outside of 16-bit space with two elements, which, I
suspect, won't work if wchar_t is 16 bit.

> Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable
> for the fast in-memory usage because some operations, for example at(int)
> and length() have not O(1) but O(n) to give a result.

I believe UTF-8 is popular party because such operations are believed to be
rare.

> They (UTF,
> T=Transformation) are better for storing texts as they save place by using
> variable number of bytes for a character. However UCS (e.g. UCS-2 or
> UCS-4) are fast for memory operations because they use fixed character
> size.

What about representing values with two 16-bit values (that's what I've
mentioned above)?

(BTW, for reference to those interested,
     http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
talks about different encodings).

> That is why I would not like to use basic_string<utf8char> in
> memory, rather basic_string<ucs4char> instead, but I would not generalize
> it for all possible applications.

In case of program options, I suspect that everything will work for UTF-8
strings. IOW, the library does not care that any given 'char' might be part
of some Unicode character. That's why it's attractive for me.

> You can expect initialization from (const char * argv []) on all platforms
> or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<>
> you already have support for the parameters from the current locale (char)
> and for parameters in UCS-2.

What do you mean by 'parameters from the current locale'? I am not sure that
ctype::widen is required to care about user-selected character encoding.
Not do I think it's requires from default codecvt facet.

If my locale on Linux is ru_RU.KOI8-R I don't think standard requires
codecvt<wchar_t, char> facet in default instance of 'locale' to do
meaningfull conversion from KOI8-R into unicode.

> If we take an option to read parameters into basic_string<wchar_t> or
> basic_string<ucs4char_t>, where the character size or encoding is not the
> same as the native encoding on the command line, there is an affinity to
> streams. Some shells allow usage of UTF-8 encoded parameters or,
> generally, usage of characters out of the current locale. It means, that a
> program can choose the way, how to encode all characters from Unicode to
> char. UTF-7/8, etc. I would like to have a solution similar to streams:
> imbue(). Having this, you could convert internally every argv[x] using
> imbue(y) applied to a stringstream, where the facet y provides the caller.

That's right. There should be some mechanism to convert from
current 8-bit encoding and source code encoding into unicode. At least at
linux there's a mechanism to do the first thing, but I'm not aware about
Windows.

> On the other hand, such a conversion can be performed also by a user. The
> parameters sent to main() are char* or wchar_t* and thus program_options
> can give them back just as they are in basic_string<char> and
> basic_string<wchar_t>. The client can use his facet to imbue a
> basic_stringstream<> initialized with the parameter from program_options.
> Or a conversion library could be used to perform the conversion, something
> like lexical_cast<> does for type converions. It is a matter of
> convenience only - separate converions library (no support for encoding in
> program_options) or imbue(), performing the conversion inside the
> program_options. However, the conversion should not be implemented for
> program_options only, that is why I suggested an existing interface -
> facets.

I agree with basing the mechanism on facets. OTOH, it's really can be made
orthogonal to program options. So initially program_options can support
only ascii (strict 7-bit) and unicode, for which conversion is trivial.

>> > Or did I miss something? Is something like this part of
>> boost already?
>>
>> Nope :-( Even UTF-8 encoder is not in boost yet.
>
> You can find come converting facets for UTF-8 raedy for imbue() to a
> stream in the files section on yahoo. Unfortunately not finished or not
> reviewed...

Yea, I can find some facets, including one written by myself ;-( And yea,
unfortunately they are not in Boost yet.

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk