Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2004-04-13 15:09:17


Miro Jurisic <macdev_at_[hidden]> writes:

> [snip]

> You are forgetting that abstract Unicode characters are defined as sequences of
> code points (even if those code points are 32-bit) and string manipulation has
> to take this into account (there are numerous combinations of characters and
> combining marks that must be treated as single units for purpose of searching,
> collation, etc.) A single encoded character type may be 32 bits, but encoded
> characters are often not the level on which the clients need to manipulate
> strings.

Right, it will certainly be necessary to provide a
grapheme_cluster_iterator (with value_type = the Unicode string
type). ICU should help with this. Nonetheless, it is useful to
represent a single code point, for several reasons:

 - For the purpose of string construction, the Unicode specification
   explicitly states that any sequence of code points is well formed,
   and so this provides the smallest unit by which
   guaranteed-well-formed strings can be formed.

 - It would be useful to provide functions for querying the Unicode
   properties of individual code points, and this code_point type
   would be the only suitable parameter type.

I do agree, however, that for almost any output formatting, the
locale-specific or user-specified fill text/symbols should be specified
as strings, rather than as individual characters.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk