Boost logo

Boost :

From: Alberto Barbati (abarbati_at_[hidden])
Date: 2002-11-04 19:14:24


Hi Boosters,

I read in the list archives that there was a proposal by Vladimir Prus
about codecvt facets for uft8-ucs2 and ucs2-ucs2 conversion. The
proposal dates back to last year. I wonder what happened of it.

However, I would like to up the ante and propose a much wider choice of
codecvt facets that could be used to effectively process Unicode text
files. This new proposal aims to be fully conformant to Unicode 3.2
requirements. Therefore, I will refer to the utf-8, utf-16 and utf-32
encodings of Unicode code points, disregarding the ucs-2 and ucs-4
counterparts.

The proposal shall include facets to convert:

   external internal
   utf-8 -> utf-16* (BMP only - no surrogates)
   utf-8 -> utf-16* (all planes - surrogates allowed)
   utf-16LE -> utf-16
   utf-16BE -> utf-16
   utf-16** -> utf-16
   utf-8 -> utf-32
   utf-16LE -> utf-32
   utf-16BE -> utf-32
   utf-16** -> utf-32
   utf-32 -> utf-32

Notes:
(*) There are two utf-8 -> utf-16 facets because a 4-bytes utf-8 code
unit sequence is mapped to a utf-16 surrogate pair. If the application
won't handle surrogates anyway, it can opt for a more optimized facet
(such processing is probably not conformant, is this "optimization"
really needed?)

(**) external utf-16 facets will detect an initial BOM (U+FEFF) to
select the endian-ness of the external stream (what to do if there is no
BOM? default to the endian-ness of the platform?)

All proposed facets will be implemented as class templates, in order to
avoid any explict reference to wchar_t or any other fixed-size integral
type. Simply, a compile-time assertion will be used to ensure that the
supplied type is large enough to hold the internal characters. (For
platforms where wchar_t has less than 32 bits an application that wants
to use utf-32 facets will thus be responsible of choosing a suitable
integral type, defining char_traits and specializing basic_*stream
accordingly.)

(future directions) The facets could use template policies, for example
  to customize error handling (for instance, if a non-character is
encountered the conversion may either signal an error o ignore the
non-character).

The library is explicitly directed to platforms were char is an 8-bits
type. Support for other platforms can be included in subsequent
revisions, according to interest.

Is there any interest in this proposal? Any feedback is appreciated.

Alberto Barbati


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk