Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2004-04-25 13:14:30


On 4/13/04 3:27 PM, "Miro Jurisic" <macdev_at_[hidden]> wrote:

> In article <00c101c42167$8d0a7f40$1b440352_at_fuji>,
> "John Maddock" <john_at_[hidden]> wrote:
[SNIP]
>> However I think we're getting ahead of ourselves here: I think a Unicode
>> library should be handled in stages:
>>
>> 1) define the data types for 8/16/32 bit Unicode characters.
>
> The fact that you believe this is a reasonable first step leads me to believe
> that you have not given much thought to the fact that even if you use a 32-bit
> Unicode encoding, a character can take up more than 32 bits (and likewise for
> 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any
> encoding.
[TRUNCATE]

Unicode code-points fit in 31-bit values. The 8- and 16-bit standards just
encode the 32-bit standard. We could base Unicode string only around the
code-points.

It may be better to use abstract Unicode characters instead. However, each
abstract character can be made up of a variable number code-points. Worse,
there can be several ways of expressing the same abstract character (that's
why there are normalization standards).

Maybe we can have:

struct unicode_code_point { int_least_32_t c; };

struct unicode_code_point_traits { /* like char_traits */ };

struct unicode_abstract_character
{
    int_least_32_t main_char; // can there be co-main characters?
    std::size_t helper_count; // length of following array
    int_least_32_t *helper_chars; // dynamic array of combiners
};

struct unicode_abstract_character_traits { /* like char_traits, but much
more complicated */ };

Recall that character types must be POD, so all the smarts have to go into
the traits class.

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk