Boost logo

Boost :

From: Edward Diener (eddielee_at_[hidden])
Date: 2004-10-19 20:38:10


Robert Ramey wrote:
> "Edward Diener" <eddielee_at_[hidden]> wrote in message
> news:cl3umm$pkp$1_at_sea.gmane.org...
>> Robert Ramey wrote:
>>> "Vladimir Prus" <ghost_at_[hidden]> wrote in message
>>> news:cl2d2p$7a3$1_at_sea.gmane.org...
>>>> This was discussed extensively before. For example, Miro has
>>>> pointed out that even plain "find" is not suitable for unicode
>>>> strings because some characters can be represeted with several
>>>> wchar_t values.
>>>>
>>>> Then, there's an issue of proper collation. Given that Unicode can
>>>> contain accents and various other "marks", it is not obvious that
>>>> string::operator< will always to the right thing.
>>>>
>>>
>>> My reference (Stroustrup, The C++ Programming language) shows the
>>> locale class containing a function
>>>
>>> template<class Ch, class Tr, class A> // compare strings using this
>>> locale bool operator()(const basic_string<Ch, Tr, A> & const
>>> basic_string<Ch, Tr,
>>>> & ) const;
>>>
>>> So I always presumed that there was a "unicode" locale that
>>> implemented this as well all other required information. Now that I
>>> think about it I realize that it was only a presumption that I never
>>> really checked. Now I wonder what facitlities do most libraries do
>>> provide for unicode facets. I know there are ansi functions for
>>> translating between multi-byte and wide character strings. I've
>>> used these functions and they did what I expected them to do. I
>>> presumed they worked in accordance with the currently
>>> selected locale and its related facets. If the
>>> basic_string<wchar_t>::operator<(...) isn't doing "the right thing"
>>> wouldn't it be just a bug in the implementation of the standard
>>> library rather than a candidate for a boost library?
>>
>> The use of 'wchar_t' is purely implementation defined as what it
>> means, other than the very little said about it in the C++ standard
>> in relation
> to
>> 'char'. It need have nothing to do with any of the Unicode
>> encodings, or
> it
>> may represent a particular Unicode encoding. This is purely up to the
>> implementation. So doing the "right thing" is purely up to the
>> implementer
>
> OK I can buy that
>
>> although, of course, the implementer will tell you what the wchar_t
>> represents for that implementation.
>
> OK - are there standard library implementations which use other than
> unicode (or variants there of) for wchar_t encodings?

I do not know if there is or not. My point being that wchar_t is not a
Unicode character by definition. That is why I believe that any new
character types, either as future built-in characters, or as a C++ class,
are needed to support Unicode encodings. As soon as one says that wchar_t
should be changed depending on that locale/facet in order to support a
Unicode encoding, one is doing the wrong thing and the reason why is that
wchar_t is C++'s idea of an implementation defined wide character only.

>
> Basically my reservations about the utility of a unicode library stem
> from the following:
>
> a) the standard library has std:::basic_string<T> where T is any type
> char, wchar_t or whatever.

Agreed

> b) all algorithms that use std::string are (or should be) applicable
> to std::basic_string<T> regardless of the actual type of T (more or
> less)

The standard algorithms work using iterators, and treat std::basic_string<T>
as a container. That is fine but it doesn't produce results that treat a
string as a meaningful collection of characters which represent a character
encoding. For that the std::basic_string<T> member functions are needed and
should be used.

> c) character encodings can be classified into two types - single
> element types like unicode (UCS-2, UCS-4) and ascii, and multi
> element types like JIS, and others.

OK, at least in the present.

> d) there exist ansi functions which translate strings from one type
> to an other based on information in the current locale. This
> information is dependent on the particular encoding.

OK, I see where you are going and you may be right. You want to continue
using 'char' and 'wchar_t' but use only the locale to define their encoding
and functionality. That sounds possible except for one issue of which I can
think. It is that 'char' and 'wchar_t' do not encompass all the possible
popular encoding sizes for fixed size encodings. Right now we have 8, 16,
and 32 bits. Perhaps we need a new C++ basic character type, maybe
'lwchar_t', and the rule sizeof(char) <= sizeof(wchar_t) <=
sizeof(lwchar_t). This would give us a better fighting chance to represent
all the most popular encodings at least as far as fixed size character sizes
are concerned.

Along with your suggestion, thought would then have to be given to what a
std::basic_string<T> really means beyond the use of narrow characters. Right
now it is implementation defined, but in the future how and where do we
specify character encodings via locales ?

> e) There is nothing particularly special about unicode in this
> scheme. Its just one more encoding scheme among many. Therefore
> making a special
> unicode library would be unnecessarily specific. Any efforts so
> spent would be better invested in generic encoding/decoding
> algorithms and/or setting up locale facts for specific encodings
> UTF-8, UTF-16, etc.

You have made a good point but see above. In your scheme, I would still want
to have the C++ standard library have all current functionality regarding
characters and strings be templated on built-in character types everywhere.
There would probably need to be a review of current and future functionality
to determine in which situations locales, with their character encoding
information, need to be passed along with a string.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk