Boost logo

Boost :

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2004-04-17 07:29:52

Marshall Clow wrote:
>>>"Reece Dunn" writes:
>>> > RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and
>>>UTF-16 are
>>>> multi-character encodings of UTF-32 (not considering combining marks
>>>>at this
>>>> stage), whereas UTF-32 is a single character encoding.

>I'm pretty sure that this is a bad assumption.

Why is this a bad assumption?

At the unicode_string level, we are talking about individual Unicode
characters as specified by As an example, U+0x20 (space) can be
represented simply on all encodings; U+0x2192 (left arrow) requires 2 bytes
for UTF-8 encoding; U+0x1Dxxxx (I think these are the Fractur characters)
require 3 UTF-8, 2 UTF-16 and 1 UTF-32.

By treating a Unicode string as a virtual UTF-32 string (no matter what the
underlying encoding is) makes it easier to use on a higher level, because
you are dealing with the characters as they are represented on the Unicode
tables. This makes it easier if there are mixed-width characters in the
   U+0x300A hello U+0x300B ==> [<<] hello [>>]

>You can't just ignore combining characters.

I am not ignoring combining characters. All I'm saying is that dealing with
grapheme clusters at this stage makes processing Unicode strings too
complex. They should be treated as a view *on top of the underlying
unicode_string represtentation*.

>I believe that Miro posted an example of how (even using UTF-32), you
>may not have a single character <<-->> single "entry" mapping.

I understand that now (see my other post), but dealing with it all at one
level would make the interface too complex and would become too difficult to
manage. You could have something like:

struct grapheme_cluster: public std::pair< unicode_string::utf32_iterator,
unicode_string::utf32_iterator >
   inline grapheme_cluster( unicode_string & us ):
      std::pair< unicode_string::utf32_iterator,
unicode_string::utf32_iterator >
      ( us.utf32_begin(), us.utf32_end())


   inline bool is_single() const
      return( first == second );

   inline unicode_string::utf32_t get_base() const
      return( *first );

   bool advance(); // implementation defined; false iff end of string

NOTE: if is_single() is true, then is_base() will be the value of the
unicode character, otherwise it is the primary character with the combining
characters removed.


Express yourself with cool emoticons - download MSN Messenger today!

Boost list run by bdawes at, gregod at, cpdaniel at, john at