Boost logo

Boost :

From: Ferdinand Prantl (ferdinand.prantl_at_[hidden])
Date: 2004-04-06 07:51:34


Hello,

> From: Vladimir Prus [mailto:ghost_at_[hidden]]
>
> > glib did a very good
> > implementation of UTF-8 handling and Glibmm is a well done
> C++ wrapper
> > but it lacks the "standardness". Something like
> boost::ustring COULD
> > bring a widely accepted UTF-8 aware unicode string to C++
> programmers.
> > A somewhat relieving thought.
>
> I am not exactly sure if UTF-8 or UCS-4 is better as
> universal solution, but some solution is surely needed.

I am afraid there is no universal solution for all users. The easiest
solution is based on the native basic_string<>, which is specialized for
char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
require another basic_string<> specialization.

UCS-2 held all characters in Unicode 1.1, There was a need for more unique
numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no
4-byte character specialization for basic_string<> in STL yet.

Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable for
the fast in-memory usage because some operations, for example at(int) and
length() have not O(1) but O(n) to give a result. They (UTF,
T=Transformation) are better for storing texts as they save place by using
variable number of bytes for a character. However UCS (e.g. UCS-2 or UCS-4)
are fast for memory operations because they use fixed character size. That
is why I would not like to use basic_string<utf8char> in memory, rather
basic_string<ucs4char> instead, but I would not generalize it for all
possible applications.

You can expect initialization from (const char * argv []) on all platforms
or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<>
you already have support for the parameters from the current locale (char)
and for parameters in UCS-2.

If we take an option to read parameters into basic_string<wchar_t> or
basic_string<ucs4char_t>, where the character size or encoding is not the
same as the native encoding on the command line, there is an affinity to
streams. Some shells allow usage of UTF-8 encoded parameters or, generally,
usage of characters out of the current locale. It means, that a program can
choose the way, how to encode all characters from Unicode to char. UTF-7/8,
etc. I would like to have a solution similar to streams: imbue(). Having
this, you could convert internally every argv[x] using imbue(y) applied to a
stringstream, where the facet y provides the caller. The target character
capabilities could choose the caller by providing a basic_string<>
specialization.

On the other hand, such a conversion can be performed also by a user. The
parameters sent to main() are char* or wchar_t* and thus program_options can
give them back just as they are in basic_string<char> and
basic_string<wchar_t>. The client can use his facet to imbue a
basic_stringstream<> initialized with the parameter from program_options. Or
a conversion library could be used to perform the conversion, something like
lexical_cast<> does for type converions. It is a matter of convenience only
- separate converions library (no support for encoding in program_options)
or imbue(), performing the conversion inside the program_options. However,
the conversion should not be implemented for program_options only, that is
why I suggested an existing interface - facets.

> > Or did I miss something? Is something like this part of
> boost already?
>
> Nope :-( Even UTF-8 encoder is not in boost yet.

You can find come converting facets for UTF-8 raedy for imbue() to a stream
in the files section on yahoo. Unfortunately not finished or not reviewed...

Ferda

>
> - Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk