Boost logo

Boost :

From: Ronald Garcia (garcia_at_[hidden])
Date: 2001-11-01 13:12:25


Sorry, I posted this previously under a very poor subject line (recently
switched to digest mode). I'm repeating so that interested parties
notice it.

Vladimir Prus wrote:
> Ronald Garcia wrote:
> > I have taken a look at the above message and the code that it refers
> > to. I can't quite grasp what the code is doing,
> > but according to descriptions it appears to provide two codecvt
> > facets: one converting from a utf-8 external (file) representation to
> > ucs2 internally (memory), and back, while the other converts from ucs2
> > externally to utf8 internally. I may be wrong and so the author may
> > wish to correct me here.

> Indeed, I wish to correct. The other codecvt converts from ucs2
> externally to ucs2 internally -- i.e. does no conversion.
Thanks for the clarification. I did forget to mention that the
codecvt I wrote converts from utf8-external to ucs4-internal.

> As far as I can tell, C++ standard does not require default
> conversion facet to use any particular encoding, and under bcc
> external files are considered to be something called "multibyte
> string". I have no idea what it is, but it does not seem to be ucs2
> at all.

I can definitely see the need for ucs-2 to ucs-2 codecvt facets. In
fact, there could even be a need for facets that differ in endianness
of the external format. Dietmar mentioned a need for this (as well as
auto-detection of endianness in XML files, which is another can of
worms) to parse XML.

> > MA> is there a reason not to introduce a fixed typedef
> > MA> boost::ucs4_t, as a uint32_t? then there could be a version
> > MA> of this that would work on any platform. as you know, on
> > MA> win32 (and elsewhere?) wchar_t is 16bits, so you are currently
> > MA> forcing platform-specific specialization.
> >
> > I chose to implement the facet as a template to avoid making solid
> > decisions about the types used to represent utf-8 elements and
> > ucs-4 elements. It makes sense that compilers with large enough
> > wchar_t should use std::codecvt<wchar_t,char,std::mbstate_t>,
> > wofstream, and wifstream for file streaming, but you
> > are correct that for windows one would have to provide
> > specializations. I'm pretty new to this area of the C++ library and
> > so I'm trying to get a feel for what works best.

> Correction again -- wchar_t is 16 bit for *some* windows compilers.
Thanks, I meant to say "VC++", not "windows"

> But in principle, ability to use any type for internal character
> would be desirable. (and it costs nothing to have it)

> > MA> even on systems where wchar_t is 32bits, there are no
> > MA> guarantees that the implementation character set is unicode.
> > MA> even if __STDC_ISO_10646__ is defined, i'm not sure if that
> > MA> strictly guarantees that the values are comparable with cast
> > MA> ints, because it (i think) is still implementation defined
> > MA> what the signedness and endianness is of wchar_t storage, even
> > MA> if the code value space is unicode.
> >
> > I'm not sure what you are referring to here. Could you run that by me
> > again?

> I also think that performance aspects of ucs2 codecvt should be considered.
Could you go into more detail about these performance aspects?


Boost list run by bdawes at, gregod at, cpdaniel at, john at