Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2004-10-19 19:50:35


Robert Ramey wrote:
> Basically my reservations about the utility of a unicode library stem from
> the following:
>
> a) the standard library has std:::basic_string<T> where T is any type
> char,
> wchar_t or whatever.

Yes. The problem with unicode is that it is not really possible to represent
a character as an atomic value. A single glyph could in extreme cases be
made up of 3 (or even more) 32 bit code units (UTF-32), and therefore
defining a good T, is nigh on impossible.

> b) all algorithms that use std::string are (or should be) applicable to
> std::basic_string<T> regardless of the actual type of T (more or less)
> c) character encodings can be classified into two types - single element
> types like unicode (UCS-2, UCS-4) and ascii, and multi element types like
> JIS, and others.

As i said, Unicode is not fixed width. Not in any encoding scheme. Therefore
it is very difficult to teach the basic_string class to correctly handle
unicode strings.

> d) there exist ansi functions which translate strings from one type to an
> other based on information in the current locale. This information is
> dependent on the particular encoding.
> e) There is nothing particularly special about unicode in this scheme.
> Its
> just one more encoding scheme among many. Therefore making a special
> unicode library would be unnecessarily specific. Any efforts so spent
> would
> be better invested in generic encoding/decoding algorithms and/or setting
> up
> locale facts for specific encodings UTF-8, UTF-16, etc.

The reason for focusing on Unicode is that is has become the de facto
standard for character representation. It is supported by most OSes and many
programming languages. This is not likely to change.

As for other encoding schemes. I actually had support for other encodings
(like UCS, Shift JIS etc.) in the back of my mind when I wrote the
implementation I described earlier. That is why the string class is called
encoded_string, and not unicode_string. If the interface of the
encoding_traits class is made general enough, it should be a piece of cake
to add support for additional encoding schemes at a later date.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk