Boost logo

Boost :

From: Robert Ramey (ramey_at_[hidden])
Date: 2004-10-19 19:08:23


"Edward Diener" <eddielee_at_[hidden]> wrote in message
news:cl3umm$pkp$1_at_sea.gmane.org...
> Robert Ramey wrote:
> > "Vladimir Prus" <ghost_at_[hidden]> wrote in message
> > news:cl2d2p$7a3$1_at_sea.gmane.org...
> >> This was discussed extensively before. For example, Miro has pointed
> >> out that even plain "find" is not suitable for unicode strings
> >> because some characters can be represeted with several wchar_t
> >> values.
> >>
> >> Then, there's an issue of proper collation. Given that Unicode can
> >> contain accents and various other "marks", it is not obvious that
> > string::operator<
> >> will always to the right thing.
> >>
> >
> > My reference (Stroustrup, The C++ Programming language) shows the
> > locale class containing a function
> >
> > template<class Ch, class Tr, class A> // compare strings using this
> > locale bool operator()(const basic_string<Ch, Tr, A> & const
> > basic_string<Ch, Tr,
> >> & ) const;
> >
> > So I always presumed that there was a "unicode" locale that
> > implemented this as well all other required information. Now that I
> > think about it I realize that it was only a presumption that I never
> > really checked. Now I wonder what facitlities do most libraries do
> > provide for unicode facets. I know there are ansi functions for
> > translating between multi-byte and wide character strings. I've used
> > these functions and they did what I expected them to do. I presumed
> > they worked in accordance with the currently
> > selected locale and its related facets. If the
> > basic_string<wchar_t>::operator<(...) isn't doing "the right thing"
> > wouldn't it be just a bug in the implementation of the standard
> > library rather than a candidate for a boost library?
>
> The use of 'wchar_t' is purely implementation defined as what it means,
> other than the very little said about it in the C++ standard in relation
to
> 'char'. It need have nothing to do with any of the Unicode encodings, or
it
> may represent a particular Unicode encoding. This is purely up to the
> implementation. So doing the "right thing" is purely up to the implementer

OK I can buy that

> although, of course, the implementer will tell you what the wchar_t
> represents for that implementation.

OK - are there standard library implementations which use other than unicode
(or variants there of) for wchar_t encodings?

Basically my reservations about the utility of a unicode library stem from
the following:

a) the standard library has std:::basic_string<T> where T is any type char,
wchar_t or whatever.
b) all algorithms that use std::string are (or should be) applicable to
std::basic_string<T> regardless of the actual type of T (more or less)
c) character encodings can be classified into two types - single element
types like unicode (UCS-2, UCS-4) and ascii, and multi element types like
JIS, and others.
d) there exist ansi functions which translate strings from one type to an
other based on information in the current locale. This information is
dependent on the particular encoding.
e) There is nothing particularly special about unicode in this scheme. Its
just one more encoding scheme among many. Therefore making a special
unicode library would be unnecessarily specific. Any efforts so spent would
be better invested in generic encoding/decoding algorithms and/or setting up
locale facts for specific encodings UTF-8, UTF-16, etc.

Robert Ramey


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk