Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2002-11-05 00:44:21


Alberto Barbati wrote:
> Hi Boosters,

Hi Alberto,

> I read in the list archives that there was a proposal by Vladimir Prus
> about codecvt facets for uft8-ucs2 and ucs2-ucs2 conversion. The
> proposal dates back to last year. I wonder what happened of it.

Nothing. AFAIK it was used in XML parser draft but nowhere else. I did
not have time/inclination to submit it. So, if you want to pick it up/
write your own, then go ahead! Of course, I'll try to provide some
feedback.

> However, I would like to up the ante and propose a much wider choice of
> codecvt facets that could be used to effectively process Unicode text
> files. This new proposal aims to be fully conformant to Unicode 3.2
> requirements. Therefore, I will refer to the utf-8, utf-16 and utf-32
> encodings of Unicode code points, disregarding the ucs-2 and ucs-4
> counterparts.

I suspect you know more about Unicode than I. What's the difference
between utf-32 and ucs-4? Between utf-16 and ucs-2?

> The proposal shall include facets to convert:
>
> external internal
> utf-8 -> utf-16* (BMP only - no surrogates)
> utf-8 -> utf-16* (all planes - surrogates allowed)
> utf-16LE -> utf-16
> utf-16BE -> utf-16
> utf-16** -> utf-16
> utf-8 -> utf-32
> utf-16LE -> utf-32
> utf-16BE -> utf-32
> utf-16** -> utf-32
> utf-32 -> utf-32

1. What is desirable is "universal" unicode->wchar_t facet, that
would use some magic to detect encoding of input stream. Do
you have any idea how we can achieve this?

2. If whar_t is 32 bits wide, then to use "* -> utf-16" facet one would
    need to use streams with custom character type, right? This feels bad
    to me -- either we should use different stream types depending on
    compiler, of have one 32-bit typedef -- and use all stream types
    with that typedef for char. Personally, I will not care about portability
    and use "whar_t" anywhere.

> Notes:
> (*) There are two utf-8 -> utf-16 facets because a 4-bytes utf-8 code
> unit sequence is mapped to a utf-16 surrogate pair. If the application
> won't handle surrogates anyway, it can opt for a more optimized facet
> (such processing is probably not conformant, is this "optimization"
> really needed?)

Sorry for asking such question, but you're the best person to answer them.
Does any application use utf-16 and surrogates? Won't 32-bits values be
much more easy to handle?

> (**) external utf-16 facets will detect an initial BOM (U+FEFF) to
> select the endian-ness of the external stream (what to do if there is no
> BOM? default to the endian-ness of the platform?)

Ah... that's what I was talking about above. But.. if we want to handle
XML, for example. It can some in utf8 or utf16, and some hooks are desirable
for detecting it. I would not like to ask programmer to imbue the appropriate
facet manually.

> The library is explicitly directed to platforms were char is an 8-bits
> type. Support for other platforms can be included in subsequent
> revisions, according to interest.
>
> Is there any interest in this proposal? Any feedback is appreciated.

Yes, there is.

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk