|
Boost : |
From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2004-04-13 15:09:17
Miro Jurisic <macdev_at_[hidden]> writes:
> [snip]
> You are forgetting that abstract Unicode characters are defined as sequences of
> code points (even if those code points are 32-bit) and string manipulation has
> to take this into account (there are numerous combinations of characters and
> combining marks that must be treated as single units for purpose of searching,
> collation, etc.) A single encoded character type may be 32 bits, but encoded
> characters are often not the level on which the clients need to manipulate
> strings.
Right, it will certainly be necessary to provide a
grapheme_cluster_iterator (with value_type = the Unicode string
type). ICU should help with this. Nonetheless, it is useful to
represent a single code point, for several reasons:
- For the purpose of string construction, the Unicode specification
explicitly states that any sequence of code points is well formed,
and so this provides the smallest unit by which
guaranteed-well-formed strings can be formed.
- It would be useful to provide functions for querying the Unicode
properties of individual code points, and this code_point type
would be the only suitable parameter type.
I do agree, however, that for almost any output formatting, the
locale-specific or user-specified fill text/symbols should be specified
as strings, rather than as individual characters.
-- Jeremy Maitin-Shepard
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk