Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2001-11-01 04:36:55


Ronald Garcia wrote:

> MA> is there any relation between your work and vladimir prus, who
> MA> uploaded some codecvt code about a month ago?
> MA> http://groups.yahoo.com/group/boost/message/17772
>
> I have taken a look at the above message and the code that it refers
> to. I can't quite grasp what the code is doing,
> but according to descriptions it appears to provide two codecvt
> facets: one converting from a utf-8 external (file) representation to
> ucs2 internally (memory), and back, while the other converts from ucs2
> externally to utf8 internally. I may be wrong and so the author may
> wish to correct me here.

Indeed, I wish to correct. The other codecvt converts from ucs2 externally to
ucs2 internally -- i.e. does no conversion. As far as I can tell, C++
standard does not require default conversion facet to use any particular
encoding, and under bcc external files are considered to be something called
"multibyte string". I have no idea what it is, but it does not seem to be
ucs2 at all.

> MA> is there a reason not to introduce a fixed typedef
> MA> boost::ucs4_t, as a uint32_t? then there could be a version
> MA> of this that would work on any platform. as you know, on
> MA> win32 (and elsewhere?) wchar_t is 16bits, so you are currently
> MA> forcing platform-specific specialization.
>
> I chose to implement the facet as a template to avoid making solid
> decisions about the types used to represent utf-8 elements and
> ucs-4 elements. It makes sense that compilers with large enough
> wchar_t should use std::codecvt<wchar_t,char,std::mbstate_t>,
> wofstream, and wifstream for file streaming, but you
> are correct that for windows one would have to provide
> specializations. I'm pretty new to this area of the C++ library and
> so I'm trying to get a feel for what works best.

Correction again -- wchar_t is 16 bit for *some* windows compilers. But in
principle, ability to use any type for internal character would be desirable.
(and it costs nothing to have it)

> MA> even on systems where wchar_t is 32bits, there are no
> MA> guarantees that the implementation character set is unicode.
> MA> even if __STDC_ISO_10646__ is defined, i'm not sure if that
> MA> strictly guarantees that the values are comparable with cast
> MA> ints, because it (i think) is still implementation defined
> MA> what the signedness and endianness is of wchar_t storage, even
> MA> if the code value space is unicode.
>
> I'm not sure what you are referring to here. Could you run that by me
> again?

I'm not sure too. Signedness of whar_t is not important if it 32-bit wide,
since, IIRC, Unicode requires only 31 bit. And I don't understand how
endianness of whar_t storage can matter at all. Regarding wchar_t and unicode
relation, we have:
std::2.13.3/2:
The value of a wide-character literal containing a single c-char has value
equal to the numerical value of the encoding of the c-char in the execution
wide-character set.
std::2.2/3:
The values of the members of the execution character sets are
implementation-defined...

This, in theory, seems to mean that wide literals can use arbitrary encoding,
but I really doubt this is ever the case in practice. Then, I think we are
free to think that it's ok to use wchar_t for Unicode, provided it's wide
enough.

I also think that performance aspects of ucs2 codecvt should be considered.

Regards,
Vladimir


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk