|
Boost : |
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-03-10 17:31:31
Graham wrote:
> As Unicode characters that are not in page zero can require more than 32
> bits
>
> to encode them [yes really]
Unless you're talking about grapheme clusters or composite characters
(are they the same thing?), not in Unicode 5. No Unicode code point
needs more than one UTF-32 unit, more than two UTF-16 units (a surrogate
pair) or more than 4 UTF-8 units (11110www 10xxxxxx 10yyyyyy 10zzzzzz
for a total of 21 bits).
> The only way I have found of handling this is to base the string
> functions
>
> on a proper Unicode character support library according to the Unicode
> spec.
>
> This means that you need character movement support, grapheme support,
> and
>
> sorting support.
>
There are several issues here. One is the ability to store text in some
encoding, and to convert it to Unicode code points or a different encoding.
The second issue is the ability to process this text. This brings in the
Unicode algorithms like Collation.
The third issue is the ability to display this text. We're talking BIDI
support and, if I understand the term correctly, character movement. (Is
this about moving the caret from grapheme to grapheme, taking into
account BIDI and ligatures?)
The nice thing is that the dependencies go strictly upwards. Storing
doesn't depend on processing, and processing doesn't depend on
displaying. So it's possible to take these one step at a time.
> As I said to Phil, Rogier and I completed a Unicode character library
> for
>
> Release under boost, but never submitted it to Boost as we had intended
> to
>
> release it with a string library built on it, and never had time to do
> the
>
> second part of the work.
>
Post it, and we'll do the second part. It's open-source.
Sebastian
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk