Boost logo

Boost :

Subject: Re: [boost] Unicode and codecvt facets
From: Artyom (artyomtnk_at_[hidden])
Date: 2010-07-05 12:27:46


>
> As some may know, I am working on a Unicode library that I plan to submit to
>Boost fairly soon.
>

Take a look on Boost.Locale proposal.

> The codecs in that library are based around iterators and ranges, but since
>there was some demand for support
> for codecvt facets I am working on adapting those into that form as well.
>
> Unfortunately, it seems it is only possible to subclass std::codecvt<char,
>char, mbstate_t> and
> std::codecvt<wchar_t, char, mbstate_t>.

Yes, these are actually the only specialized classes. More then that
std::codecvt<char, char, mbstate_t>
should be - "noconvert" facet.

> I personally don't know and understand that much about iostreams/locales, but
>I have looked quickly at
> libstdc++'s implementation and it doesn't seem like it is possible for
>std::locale to contain any other instance
> of codecvt.

You can derive from these two classes in re-implement them (like I did in
Boost.Locale).

Also I strongly recommend to take a look on locale and iostreams in standard
library if you are working with Unicode for C++.

>
> What I wonder is if there is really a point to facets, then.
> std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would
>be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be
>UTF-8.

Not exactly narrow encoding may be any 8-bit encoding, even something like
Latin1 or Shift-JIS (and UTF-8 as well).

> The problem is that wchar_t is platform-dependent and not really reliable, so
>it's not really something I'd recommend to use as the in-memory representation
>to deal with Unicode.

Welcome to broken Unicode world of C++. Yes. wchar_t is platform dependent, if
you want to use it you should

support both of these encodings UTF-16 and UTF-32 (technically it may be even 8
bits wide, but there is no
such implementations).

C++0x provides char16_t and char32_t to fix this standard's bug.

>
> Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing
>with UTF-8 rather than
> maybe UTF-16 or UTF-32?
>

Ask Windows developers, they use wide strings because it is the only way to work
correctly with their OS.

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk