Boost logo

Boost :

From: Alberto Barbati (abarbati_at_[hidden])
Date: 2002-11-05 16:22:47


Vladimir Prus wrote:

> Nothing. AFAIK it was used in XML parser draft but nowhere else. I did
> not have time/inclination to submit it. So, if you want to pick it up/
> write your own, then go ahead! Of course, I'll try to provide some
> feedback.

Thanks for you support!

> I suspect you know more about Unicode than I. What's the difference
> between utf-32 and ucs-4? Between utf-16 and ucs-2?

ucs-2 and ucs-4 are defined by ISO/IEC 10646, while utf-16 and utf-32
are defined by Unicode. There is ongoing effort to harmonize the two
standards, for example there is now a formal agreement to keep the same
code values. However, there are little semantic differences. The most
important one is that ucs-2 is a strict 16 bit encoding of the BMP
(basic multilingual plane, that is all characters with code point in the
range 0000-FFFF), while utf-16 encodes, through surrogate pairs, all
Unicode characters.

> 1. What is desirable is "universal" unicode->wchar_t facet, that
> would use some magic to detect encoding of input stream. Do
> you have any idea how we can achieve this?

It can be done, as long as the external stream begins with a BOM
(U+FEFF). In that case, the first 4 bytes of the stream univocally
determine the external encoding. If the external stream has no BOM, the
encoding can be determined only heuristically...

> 2. If whar_t is 32 bits wide, then to use "* -> utf-16" facet one would
> need to use streams with custom character type, right? This feels bad
> to me -- either we should use different stream types depending on
> compiler, of have one 32-bit typedef -- and use all stream types
> with that typedef for char. Personally, I will not care about
> portability
> and use "whar_t" anywhere.

I do care about portability and would like to avoid mentioning wchar_t
explicitly. Templates make this possible. The application is free to use
wchar_t as it pleases, provided it has the required number of bits.

If wchar_t is 32 bits wide and you want to use a conversion to utf-16
you don't need to define streams with a custom character type as in this
case a wchar_t is perfectly capable of containing a utf-16 code unit.
The internal type need not have *exactly* the required number of bytes,
it just need to have *at least* that amount.

The problem is indeed the opposite: when your platform has wchar_t of 16
bits and you want utf-32... In this case you need to define streams over
a custom character type, for example boost::uint_least32_t. I don't know
on which platform do you work on, but on the ones that I usually work
on, this is the most common case. That's why I don't want to mention
wchar_t explicitly.

> Sorry for asking such question, but you're the best person to answer them.
> Does any application use utf-16 and surrogates? Won't 32-bits values be
> much more easy to handle?

According to this document
http://www.unicode.org/iuc/iuc18/papers/a8.ppt all new Microsoft product
uses surrogates. I guess there are a lot more and that there will be
plenty as soon as the concept gets more widely accepted by the
community. Having libraries that supports such concept is a step in that
direction. Remember that utf-32 is a real waste of memory and there are
platforms (for example embedded platforms and cellular phones) that may
not afford such waste. Of course, in absence of specific constraints, I
agree that using utf-32 is a better option.

>> Is there any interest in this proposal? Any feedback is appreciated.
>
> Yes, there is.

Thanks for it. I appreciate that.

Alberto


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk