Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2004-04-25 13:14:30

On 4/13/04 3:27 PM, "Miro Jurisic" <macdev_at_[hidden]> wrote:

> In article <00c101c42167$8d0a7f40$1b440352_at_fuji>,
> "John Maddock" <john_at_[hidden]> wrote:
>> However I think we're getting ahead of ourselves here: I think a Unicode
>> library should be handled in stages:
>> 1) define the data types for 8/16/32 bit Unicode characters.
> The fact that you believe this is a reasonable first step leads me to believe
> that you have not given much thought to the fact that even if you use a 32-bit
> Unicode encoding, a character can take up more than 32 bits (and likewise for
> 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any
> encoding.

Unicode code-points fit in 31-bit values. The 8- and 16-bit standards just
encode the 32-bit standard. We could base Unicode string only around the

It may be better to use abstract Unicode characters instead. However, each
abstract character can be made up of a variable number code-points. Worse,
there can be several ways of expressing the same abstract character (that's
why there are normalization standards).

Maybe we can have:

struct unicode_code_point { int_least_32_t c; };

struct unicode_code_point_traits { /* like char_traits */ };

struct unicode_abstract_character
    int_least_32_t main_char; // can there be co-main characters?
    std::size_t helper_count; // length of following array
    int_least_32_t *helper_chars; // dynamic array of combiners

struct unicode_abstract_character_traits { /* like char_traits, but much
more complicated */ };

Recall that character types must be POD, so all the smarts have to go into
the traits class.

Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com

Boost list run by bdawes at, gregod at, cpdaniel at, john at