Boost logo

Boost :

From: Ronald Garcia (garcia_at_[hidden])
Date: 2001-11-01 13:12:25


Hi,

Sorry, I posted this previously under a very poor subject line (recently
switched to digest mode). I'm repeating so that interested parties
notice it.

Vladimir Prus wrote:
> Ronald Garcia wrote:
>
> > I have taken a look at the above message and the code that it refers
> > to. I can't quite grasp what the code is doing,
> > but according to descriptions it appears to provide two codecvt
> > facets: one converting from a utf-8 external (file) representation to
> > ucs2 internally (memory), and back, while the other converts from ucs2
> > externally to utf8 internally. I may be wrong and so the author may
> > wish to correct me here.
>

> Indeed, I wish to correct. The other codecvt converts from ucs2
> externally to ucs2 internally -- i.e. does no conversion.
Thanks for the clarification. I did forget to mention that the
codecvt I wrote converts from utf8-external to ucs4-internal.

> As far as I can tell, C++ standard does not require default
> conversion facet to use any particular encoding, and under bcc
> external files are considered to be something called "multibyte
> string". I have no idea what it is, but it does not seem to be ucs2
> at all.

I can definitely see the need for ucs-2 to ucs-2 codecvt facets. In
fact, there could even be a need for facets that differ in endianness
of the external format. Dietmar mentioned a need for this (as well as
auto-detection of endianness in XML files, which is another can of
worms) to parse XML.

>
> > MA> is there a reason not to introduce a fixed typedef
> > MA> boost::ucs4_t, as a uint32_t? then there could be a version
> > MA> of this that would work on any platform. as you know, on
> > MA> win32 (and elsewhere?) wchar_t is 16bits, so you are currently
> > MA> forcing platform-specific specialization.
> >
> > I chose to implement the facet as a template to avoid making solid
> > decisions about the types used to represent utf-8 elements and
> > ucs-4 elements. It makes sense that compilers with large enough
> > wchar_t should use std::codecvt<wchar_t,char,std::mbstate_t>,
> > wofstream, and wifstream for file streaming, but you
> > are correct that for windows one would have to provide
> > specializations. I'm pretty new to this area of the C++ library and
> > so I'm trying to get a feel for what works best.

>
> Correction again -- wchar_t is 16 bit for *some* windows compilers.
Thanks, I meant to say "VC++", not "windows"

> But in principle, ability to use any type for internal character
> would be desirable. (and it costs nothing to have it)
Agreed.

>
> > MA> even on systems where wchar_t is 32bits, there are no
> > MA> guarantees that the implementation character set is unicode.
> > MA> even if __STDC_ISO_10646__ is defined, i'm not sure if that
> > MA> strictly guarantees that the values are comparable with cast
> > MA> ints, because it (i think) is still implementation defined
> > MA> what the signedness and endianness is of wchar_t storage, even
> > MA> if the code value space is unicode.
> >
> > I'm not sure what you are referring to here. Could you run that by me
> > again?
>

>
> I also think that performance aspects of ucs2 codecvt should be considered.
>
Could you go into more detail about these performance aspects?

Ron


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk