From: Damien Fisher (dfisher_at_[hidden])
Date: 2001-09-26 07:32:58
----- Original Message -----
Sent: Wednesday, September 26, 2001 5:45 PM
Subject: [boost] Unicode in C++; Was: New file uploaded to boost
> > File : /dlw_uc.zip
> > Uploaded by : darylew_at_m...
> > Description : Version 1; Prototype Unicode library
> Obviously the obvious isn't obvious to everybody... :-(
> Just for the record: 16 bits arn't sufficient to represent Unicode
> in the first place! (I'm feeling really sorry for the poor Java
> guys with their required 16 bit wide character type; in C++
> implementers at least have the possibility to use a reasonable
> wide character type although some vendors successfully managed to
> choose the wrong number of bits...) Jeremy's article indicates that
> Unicode has 21 bits although I think it is only 20 bits but in any
> case it is more than 16 bits.
> The approach to add Unicode characters to the standard C++ library
> is to provide a typedef for the platforms having a "reasonable"
> choice for 'wchar_t' (to be fair: at the time some vendors decided
> to use 16 bits for 'wchar_t' Unicode indeed *was* 16 bits only;
> this changed relatively recently). For other platforms another
> character type has to be defined. A character type basically has
> to satisfy the following:
> - It has to be a POD (well, I think it isn't explicit but taking
> it all together it seems that this is the only viable choice).
> - There has to be an appropriate character traits class, say a
> a specialization for 'std::char_traits<ucchar_t>' (which is what
> is basically there in the uploaded file). A basic restriction is
> that 'pos_type' is just 'std::fpos<std::mbstate_t>' and
> 'state_ype' is 'std::mbstate_t' (otherwise there are no
> guarantees for the underlying streams). Note that 'ucint_t' can
> be identical to 'ucchar_t' if it has more bits than are necessary
> for the representation of all valid characters. That is, the
> 'char_type' and 'int_type' can be identical. The only requirement
> is that 'int_type' has one value which can be used to indicate
> various conditions, mostly EOF (hence it's name) but also various
> other errors.
> - For convenient processing, an appropriate set of
> 'std::codecvt<ucchar_t, char, std::mbstate_t>' facets is used, eg.
> one for UTF-8, one for UTF-16, one for ... This should enable
> correct reading and writing of Unicode files.
> - To use the standard library formatting facilities, which includes
> writing simple string (which still have to be null terminated,
> BTW; I don't know whether Unicode has assigned a value to '0' but
> I think it has not; since the "null character" is constructed by
> the default constructor, this can be taken care of anyway), a
> 'std::ctype<ucchar_t>' facet is necessary. For more advanced
> formatting, in particular for numeric formatting, a
> 'std::numpunct<ucchar_t>' facet also has to be provided.
> Once, all of this is in place, there should be no problem to define
> 'std::basic_string<ucchar_t>' and a family of stream classes using
> 'ucchar_t'. For successful use of the stream classes, these have to
> use an appropriate 'std::locale' object with all those facets
> included which is, of course, no problem when 'ucchar_t' is just a
> typedef for 'wchar_t'. Otherwise a locale object with the facets
> added has to be created and either 'imbue()'ed into all 'ucchar_t'
> streams or installed also global locale.
> With this approach nearly everything is addressed. What is not
> addressed are encodings leaking into the core via eg. a socket
> stream which bypasses file streams: Only file streams (more
> precisely 'std::basic_filebuf') use the code conversion facets by
> default. Of course, eg. a socket stream can also use a code
> conversion facet but it is pretty hard to implement a reasonable
> stream buffer using code conversions. Actually when you look at
> typical implementations shipping with commercial compilers you will
> see that characters are processed individually (at least under
> certain conditions). This isn't really reasonable! What is needed
> here is probably some sort of a filtering stream buffer changing
> the character type by applying a code conversion: The interface to
> external sources using 'char' (or, actually, any other character
> type) underneath would be just 'std::streambuf<char>' but potential
> encodings would be transformed by the corresponding filtering
> stream buffer. I have only a partial implementation of such a
> filtering stream buffer ready (as part of my standard C++ library
> implementation). Otherwise I would contribute it...
AFAICT (from the spec at
http://www.w3.org/TR/2000/REC-xml-20001006#sec-internal-ent), the ability to
switch between encodings for a given stream is vital for a conformant XML
parser, as it must support at least UTF-8 and UTF-16 encodings, and these
can be changed at an entity-by-entity level.
While I have to admit I never really use the C++ stream libraries (never
found a need; never fast enough for my needs; portability not needed; etc
etc), so I don't know what I'm talking about, it seems from a cursory
inspection of the documentation for my implementation of the STL that such
an ability is lacking. It also seems that the last paragraph in the above
e-mail attempts to address this issue. So I would guess that such a library
would be required before we could really expect to develop any useful XML
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk