Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2001-11-02 04:30:42


Ronald Garcia wrote:

> Vladimir Prus wrote:
> > Ronald Garcia wrote:
> > > I have taken a look at the above message and the code that it refers
> > > to. I can't quite grasp what the code is doing,
> > > but according to descriptions it appears to provide two codecvt
> > > facets: one converting from a utf-8 external (file) representation to
> > > ucs2 internally (memory), and back, while the other converts from ucs2
> > > externally to utf8 internally. I may be wrong and so the author may
> > > wish to correct me here.
> >
> > Indeed, I wish to correct. The other codecvt converts from ucs2
> > externally to ucs2 internally -- i.e. does no conversion.
>
> Thanks for the clarification. I did forget to mention that the
> codecvt I wrote converts from utf8-external to ucs4-internal.
Looking at code, I see this is the case with my codecvt also :-)

> > As far as I can tell, C++ standard does not require default
> > conversion facet to use any particular encoding, and under bcc
> > external files are considered to be something called "multibyte
> > string". I have no idea what it is, but it does not seem to be ucs2
> > at all.
>
> I can definitely see the need for ucs-2 to ucs-2 codecvt facets. In
> fact, there could even be a need for facets that differ in endianness
> of the external format. Dietmar mentioned a need for this (as well as
> auto-detection of endianness in XML files, which is another can of
> worms) to parse XML.

Auto-detection of endianness must be considered of cause. A straighforward
solution would be to have codecvt facets for all cases. You'd than peek on
enought input to guess encoding, rewind the stream, and change codecvt.
However, this requires logic beyond wi(f)stream. It one wants to just read
ucs2 file, mbstate_t can be used -- initially, it's in "just started" state.
Upon reading the first 0xFFFE symbol, endianness is detected and mbstate is
changed accordingly. What would you say about this options?

> > But in principle, ability to use any type for internal character
> > would be desirable. (and it costs nothing to have it)
>
> Agreed.

Also, I'm not sure if ability to perform ucsX <-> utf-8 conversion outside of
streams is needed. If so, either your recoding iterators or stringstream may
be employed.

> > I also think that performance aspects of ucs2 codecvt should be
> > considered.
>
> Could you go into more detail about these performance aspects?

I mean that some profiling should be done -- stream buffer layer is quite
fast in itself, and it's good if performance of codecvt will be stated in
docs. Not that I noticed any apparent problems with your code, but who knows
what compiler will do?

Regards,
Vladimir


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk