Boost logo

Boost :

From: dietmar_kuehl_at_[hidden]
Date: 2001-09-26 02:45:41


Hi,

> File : /dlw_uc.zip
> Uploaded by : darylew_at_m...
> Description : Version 1; Prototype Unicode library

Obviously the obvious isn't obvious to everybody... :-(

Just for the record: 16 bits arn't sufficient to represent Unicode
in the first place! (I'm feeling really sorry for the poor Java
guys with their required 16 bit wide character type; in C++
implementers at least have the possibility to use a reasonable
wide character type although some vendors successfully managed to
choose the wrong number of bits...) Jeremy's article indicates that
Unicode has 21 bits although I think it is only 20 bits but in any
case it is more than 16 bits.

The approach to add Unicode characters to the standard C++ library
is to provide a typedef for the platforms having a "reasonable"
choice for 'wchar_t' (to be fair: at the time some vendors decided
to use 16 bits for 'wchar_t' Unicode indeed *was* 16 bits only;
this changed relatively recently). For other platforms another
character type has to be defined. A character type basically has
to satisfy the following:

- It has to be a POD (well, I think it isn't explicit but taking
  it all together it seems that this is the only viable choice).
- There has to be an appropriate character traits class, say a
  a specialization for 'std::char_traits<ucchar_t>' (which is what
  is basically there in the uploaded file). A basic restriction is
  that 'pos_type' is just 'std::fpos<std::mbstate_t>' and
  'state_ype' is 'std::mbstate_t' (otherwise there are no
  guarantees for the underlying streams). Note that 'ucint_t' can
  be identical to 'ucchar_t' if it has more bits than are necessary
  for the representation of all valid characters. That is, the
  'char_type' and 'int_type' can be identical. The only requirement
  is that 'int_type' has one value which can be used to indicate
  various conditions, mostly EOF (hence it's name) but also various
  other errors.
- For convenient processing, an appropriate set of
  'std::codecvt<ucchar_t, char, std::mbstate_t>' facets is used, eg.
  one for UTF-8, one for UTF-16, one for ... This should enable
  correct reading and writing of Unicode files.
- To use the standard library formatting facilities, which includes
  writing simple string (which still have to be null terminated,
  BTW; I don't know whether Unicode has assigned a value to '0' but
  I think it has not; since the "null character" is constructed by
  the default constructor, this can be taken care of anyway), a
  'std::ctype<ucchar_t>' facet is necessary. For more advanced
  formatting, in particular for numeric formatting, a
  'std::numpunct<ucchar_t>' facet also has to be provided.

Once, all of this is in place, there should be no problem to define
'std::basic_string<ucchar_t>' and a family of stream classes using
'ucchar_t'. For successful use of the stream classes, these have to
use an appropriate 'std::locale' object with all those facets
included which is, of course, no problem when 'ucchar_t' is just a
typedef for 'wchar_t'. Otherwise a locale object with the facets
added has to be created and either 'imbue()'ed into all 'ucchar_t'
streams or installed also global locale.

With this approach nearly everything is addressed. What is not
addressed are encodings leaking into the core via eg. a socket
stream which bypasses file streams: Only file streams (more
precisely 'std::basic_filebuf') use the code conversion facets by
default. Of course, eg. a socket stream can also use a code
conversion facet but it is pretty hard to implement a reasonable
stream buffer using code conversions. Actually when you look at
typical implementations shipping with commercial compilers you will
see that characters are processed individually (at least under
certain conditions). This isn't really reasonable! What is needed
here is probably some sort of a filtering stream buffer changing
the character type by applying a code conversion: The interface to
external sources using 'char' (or, actually, any other character
type) underneath would be just 'std::streambuf<char>' but potential
encodings would be transformed by the corresponding filtering
stream buffer. I have only a partial implementation of such a
filtering stream buffer ready (as part of my standard C++ library
implementation). Otherwise I would contribute it...

--
<mailto:dietmar_kuehl_at_[hidden]> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk