|
Boost : |
From: Damien Fisher (dfisher_at_[hidden])
Date: 2001-09-26 07:32:58
----- Original Message -----
From: <dietmar_kuehl_at_[hidden]>
To: <boost_at_[hidden]>
Sent: Wednesday, September 26, 2001 5:45 PM
Subject: [boost] Unicode in C++; Was: New file uploaded to boost
> Hi,
>
> > File : /dlw_uc.zip
> > Uploaded by : darylew_at_m...
> > Description : Version 1; Prototype Unicode library
>
> Obviously the obvious isn't obvious to everybody... :-(
>
> Just for the record: 16 bits arn't sufficient to represent Unicode
> in the first place! (I'm feeling really sorry for the poor Java
> guys with their required 16 bit wide character type; in C++
> implementers at least have the possibility to use a reasonable
> wide character type although some vendors successfully managed to
> choose the wrong number of bits...) Jeremy's article indicates that
> Unicode has 21 bits although I think it is only 20 bits but in any
> case it is more than 16 bits.
>
> The approach to add Unicode characters to the standard C++ library
> is to provide a typedef for the platforms having a "reasonable"
> choice for 'wchar_t' (to be fair: at the time some vendors decided
> to use 16 bits for 'wchar_t' Unicode indeed *was* 16 bits only;
> this changed relatively recently). For other platforms another
> character type has to be defined. A character type basically has
> to satisfy the following:
>
> - It has to be a POD (well, I think it isn't explicit but taking
> it all together it seems that this is the only viable choice).
> - There has to be an appropriate character traits class, say a
> a specialization for 'std::char_traits<ucchar_t>' (which is what
> is basically there in the uploaded file). A basic restriction is
> that 'pos_type' is just 'std::fpos<std::mbstate_t>' and
> 'state_ype' is 'std::mbstate_t' (otherwise there are no
> guarantees for the underlying streams). Note that 'ucint_t' can
> be identical to 'ucchar_t' if it has more bits than are necessary
> for the representation of all valid characters. That is, the
> 'char_type' and 'int_type' can be identical. The only requirement
> is that 'int_type' has one value which can be used to indicate
> various conditions, mostly EOF (hence it's name) but also various
> other errors.
> - For convenient processing, an appropriate set of
> 'std::codecvt<ucchar_t, char, std::mbstate_t>' facets is used, eg.
> one for UTF-8, one for UTF-16, one for ... This should enable
> correct reading and writing of Unicode files.
> - To use the standard library formatting facilities, which includes
> writing simple string (which still have to be null terminated,
> BTW; I don't know whether Unicode has assigned a value to '0' but
> I think it has not; since the "null character" is constructed by
> the default constructor, this can be taken care of anyway), a
> 'std::ctype<ucchar_t>' facet is necessary. For more advanced
> formatting, in particular for numeric formatting, a
> 'std::numpunct<ucchar_t>' facet also has to be provided.
>
> Once, all of this is in place, there should be no problem to define
> 'std::basic_string<ucchar_t>' and a family of stream classes using
> 'ucchar_t'. For successful use of the stream classes, these have to
> use an appropriate 'std::locale' object with all those facets
> included which is, of course, no problem when 'ucchar_t' is just a
> typedef for 'wchar_t'. Otherwise a locale object with the facets
> added has to be created and either 'imbue()'ed into all 'ucchar_t'
> streams or installed also global locale.
>
> With this approach nearly everything is addressed. What is not
> addressed are encodings leaking into the core via eg. a socket
> stream which bypasses file streams: Only file streams (more
> precisely 'std::basic_filebuf') use the code conversion facets by
> default. Of course, eg. a socket stream can also use a code
> conversion facet but it is pretty hard to implement a reasonable
> stream buffer using code conversions. Actually when you look at
> typical implementations shipping with commercial compilers you will
> see that characters are processed individually (at least under
> certain conditions). This isn't really reasonable! What is needed
> here is probably some sort of a filtering stream buffer changing
> the character type by applying a code conversion: The interface to
> external sources using 'char' (or, actually, any other character
> type) underneath would be just 'std::streambuf<char>' but potential
> encodings would be transformed by the corresponding filtering
> stream buffer. I have only a partial implementation of such a
> filtering stream buffer ready (as part of my standard C++ library
> implementation). Otherwise I would contribute it...
AFAICT (from the spec at
http://www.w3.org/TR/2000/REC-xml-20001006#sec-internal-ent), the ability to
switch between encodings for a given stream is vital for a conformant XML
parser, as it must support at least UTF-8 and UTF-16 encodings, and these
can be changed at an entity-by-entity level.
While I have to admit I never really use the C++ stream libraries (never
found a need; never fast enough for my needs; portability not needed; etc
etc), so I don't know what I'm talking about, it seems from a cursory
inspection of the documentation for my implementation of the STL that such
an ability is lacking. It also seems that the last paragraph in the above
e-mail attempts to address this issue. So I would guess that such a library
would be required before we could really expect to develop any useful XML
parser.
Damien
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk