Boost logo

Boost :

From: Ferdinand Prantl (ferdinand.prantl_at_[hidden])
Date: 2004-04-06 11:57:04


Hi Volodya,

> From: Vladimir Prus [mailto:ghost_at_[hidden]]
>
> > UCS-2 held all characters in Unicode 1.1, There was a need for more
> > unique numbers and UCS-4 was introduced in Unicode 2.0.
> Unfortunately
> > there is no 4-byte character specialization for
> basic_string<> in STL yet.
>
> Or, to be exact, there's no agreement if wchar_t should be
> 32-bit or 16-bit.
> Linux (or gcc specifically) uses 32 bits, and Windows 16,
> which means wstring only suitable for UCS-2.

Wow, I did not know that about gcc, what a smart compiler... ;-)

MSVC does it usually because Win32 API id ANSI/UCS-2, so the usage is
straightforward then.

> Besides, UCS-2
> has a mechanism to represent characters outside of 16-bit
> space with two elements, which, I suspect, won't work if
> wchar_t is 16 bit.

Actually, not UCS-2, but another variable-sized character encoding was
introduced in Unicode 2.0 - UTF-16. It serves for the 16-bit characters
similarly like UTF-8 for the 8-bit ones. Using numbers from the unused (and
forbidden) space it allows to encode characters in more than one element and
so break the constraint of the 16-bits. It works with wchar_t, if only UCS-2
characters are stored in it; because the special characters are forbidden in
UCS-2 anyway.

However, UTF-16 brings the inconvenience of variable-sized character and
those, who used UCS-2 and "enjoyed" the fixed character size should switch
to UCS-4 to carry on... Not all applications need it, indeed, UCS-2 is
enough, usually.

> > Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not
> > suitable for the fast in-memory usage because some operations, for
> > example at(int) and length() have not O(1) but O(n) to give
> a result.
>
> I believe UTF-8 is popular party because such operations are
> believed to be rare.

Really? string.length() ? You must have been joking... :-)

UTF-8 is popular because the vast amount of texts contains much more ASCII
characters than the accented ones. It saves space to store them in this
encoding. It was with the UTF-7 encoding also the first one, thus the
oldest.

It can be also comfortable for the application not to perform any conversion
and work with the text in its native encoding - here can help the stringized
interface working with UTF-8. Not for all applications, however. For
example, XML parsers tend to work in UTF-16 internally, because UCS-4 eats
much memory and UTF-8 as well for East Asian encodings. I also feel, that
the applications, which parse texts and extracts string parameters to be
used in some structures or as user input are processed usually with
fixed-size strings, because the string operations are easier.

> > They (UTF,
> > T=Transformation) are better for storing texts as they save
> place by
> > using variable number of bytes for a character. However UCS (e.g.
> > UCS-2 or
> > UCS-4) are fast for memory operations because they use
> fixed character
> > size.
>
> What about representing values with two 16-bit values (that's
> what I've mentioned above)?

You mean UTF-16. Yes, it is also a possiblity, but having wchar_t not fixed
sized, you would have to provide basic_string<char16_t>...
I see it pretty the same like your UTF-8 option, because both encodings use
variable-sized character, being so incompatible with the shared
implementation of basic_string<> for char and wchar_t.

Moreover, if basic_string<char> from the API would produce UTF-8 string,
that it would be only consistent, if basic_string<wchar_t> produced UTF-16
and not UCS-2. To say nothing of support of 4-byte wchar_t and UTF-32 :-)

> (BTW, for reference to those interested,
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
> talks about different encodings).
>
> > That is why I would not like to use basic_string<utf8char>
> in memory,
> > rather basic_string<ucs4char> instead, but I would not
> generalize it
> > for all possible applications.
>
> In case of program options, I suspect that everything will
> work for UTF-8 strings. IOW, the library does not care that
> any given 'char' might be part of some Unicode character.
> That's why it's attractive for me.

It depends on the way of usage of basic_string<> inside and outside of the
program_options. The implemenation of basic_string<> is done for fixed-size
characters. It does not do searching/iterating/length counting with regards
to the character size, which can vary from 1 to 6 bytes in UTF-8.

For example string.find() would produce unexpected results if in haystack
was stored UTF-8 and a needle was in ANSI, where all 255 numbers are
allowed. Not only the library must be aware of it, but also the users, that
for example basic_string<char>.size() does not deliver the count of
characters but the count of bytes in a string then.

That is why I do not like usage of basic_string<> for variable sized
characters; basic_string<char> would not have consistent behavior for
different encodings; as long as it is used for different ANSI encodings, the
only misunderstanding is when two different characters in two local
alphabets share the same character above ASCII. For other methods is
basic_string<char> consistent.

> > You can expect initialization from (const char * argv []) on all
> > platforms or (const wchar_t * argv []) on Windows in UCS-2.
> With the
> > basic_string<> you already have support for the parameters from the
> > current locale (char) and for parameters in UCS-2.
>
> What do you mean by 'parameters from the current locale'?

If you set LC_CTYPE to something like "cs_CZ.iso8859-2" on UN*Xes or choose
"Czech" in Windows Control Panel, command line shell (well, not only it)
will start to accept and deliver in (char * argv []) also characters from a
local alphabet (here Czech, for example). Such string you would have convert
into UTF-8, if you wanted to have an UTF-8 interface. It means, you would
have to do more than basic_string<char>(argv[x])...

> I
> am not sure that ctype::widen is required to care about
> user-selected character encoding.
> Not do I think it's requires from default codecvt facet.

AFAIK it must care; the method widen() needs locale to provide an extension
from char to a templated char_type. STL in MSVC uses correctly mbcstowc() to
perform the conversion from the local alphabet to the UCS-2. I hope that
no-one simply casts to wchar_t, being so reliable only for 7-bit characters.
Anyway, it is always possible to write an own converting facet and force its
usage for widen by with imbue.

> If my locale on Linux is ru_RU.KOI8-R I don't think standard
> requires codecvt<wchar_t, char> facet in default instance of
> 'locale' to do meaningfull conversion from KOI8-R into unicode.

Hmm, I am not sure about the ISO/C++ definition. If the standard
implementation were something like "return (wchar_t) x" then it would be
useless for anything except 7-bit ASCII. MSVC works with locale and it
should be according to the standard, hopefully.

Yet delivers the shell the variable "char * argv []" in ru_RU.KOI8-R to you.

> > If we take an option to read parameters into
> basic_string<wchar_t> or
> > basic_string<ucs4char_t>, where the character size or
> encoding is not
> > the same as the native encoding on the command line, there is an
> > affinity to streams. Some shells allow usage of UTF-8 encoded
> > parameters or, generally, usage of characters out of the current
> > locale. It means, that a program can choose the way, how to
> encode all
> > characters from Unicode to char. UTF-7/8, etc. I would like
> to have a solution similar to streams:
> > imbue(). Having this, you could convert internally every
> argv[x] using
> > imbue(y) applied to a stringstream, where the facet y
> provides the caller.
>
> That's right. There should be some mechanism to convert from
> current 8-bit encoding and source code encoding into unicode.
> At least at linux there's a mechanism to do the first thing,
> but I'm not aware about Windows.

There are such methods in stdlib.h, which produce a wide character or a
string of wide characters:
  int mbtowc(wchar_t *pwc, const char *s, size_t n);
  size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);
These methods work with the locale supported in the operating system and
thus you can always get a wchar_t from a char. And these should be used in
the default facet in STL.

> > On the other hand, such a conversion can be performed also
> by a user.
> > The parameters sent to main() are char* or wchar_t* and thus
> > program_options can give them back just as they are in
> > basic_string<char> and basic_string<wchar_t>. The client
> can use his
> > facet to imbue a basic_stringstream<> initialized with the
> parameter from program_options.
> > Or a conversion library could be used to perform the conversion,
> > something like lexical_cast<> does for type converions. It
> is a matter
> > of convenience only - separate converions library (no support for
> > encoding in
> > program_options) or imbue(), performing the conversion inside the
> > program_options. However, the conversion should not be
> implemented for
> > program_options only, that is why I suggested an existing
> interface -
> > facets.
>
> I agree with basing the mechanism on facets. OTOH, it's
> really can be made orthogonal to program options. So
> initially program_options can support only ascii (strict
> 7-bit) and unicode, for which conversion is trivial.

Hmm, if it was orthogonal to program options, like facets are to streams,
then there is no need to declare support of encodings. char * and
basic_string<char> are simply in the local encoding and wchar_t * and
basic_string<wchar_t> are wide according to the understanding of "wideness"
on a platform (Unicode).

It will also support the input, which you can expect for program_options
from main(): char * in the current locale encoding. If someone wants
something else, he can use facet and streams to convert it (which could be
hidden in program_options with imbue, but not necessarily).

I think the point is not to support UTF-8 or UCS-2 but to support encodings
explicitely declared or not. I would not implicitely return UTF-8 or UTF-16
basic_string<>s, not only because STL handles their chars implicitely
according to the current locale, but also because
basic_string<char>(argv[x]) does not behave this way. Having the encoding
independent on program_options using some external conversions, possibly
imbuable into program_options, make the library thinner and more
concentrated on the problem: parse the command line. Like the streams - read
characters from input.

Those, who say "we need full Unicode, UTF-8 or UCS-4" have the same option
as if working with streams - imbue or convert. Otherwise they have currebnt
locale char from istream or wchar_t from wistream.

> > You can find come converting facets for UTF-8 raedy for
> imbue() to a
> > stream in the files section on yahoo. Unfortunately not finished or
> > not reviewed...
>
> Yea, I can find some facets, including one written by myself
> ;-( And yea, unfortunately they are not in Boost yet.

Great! You can probly prepare it for review, if your time allows it... :-)

Ferda

>
> - Volodya
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk