Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-09-27 16:37:09


"James Porter" <porterj_at_[hidden]> writes:

> I see what you mean. Still, fixed-width-encoded strings are a lot easier to
> code, and I think we should focus on them first just to get something
> working and to have a platform to test code conversion on, which in my
> opinion is the most important part.

I think as others have said, in practice a fixed-width encoding really
gains you very little or nothing at all. Needing random access to code
points is, I think, an extremely rare operation. Replacing one code
point with another code point is also likewise a rare operation; in
general you would replace one substring (perhaps a grapheme cluster)
with another substring (which may also be a grapheme cluster).

[snip]

> That said, I think a good (general) roadmap for this project would be:
> 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though
> string constants may pose a problem)

UCS-2 is bogus and should not be used at all. Conceivably UCS-4 is
legitimate but in practice not likely to be used by anyone. Still, it
is probably important to support it. The primary encodings of Unicode
to be supported should be UTF-8 and UTF-16.

> 2) Add code conversion to move between encodings, especially for I/O
> 3) Create VWE string class (fairly easy if immutable, hard if mutable)

I don't think the issues of a mutable UTF-8/UTF-16 representation are
very different from the issues of a mutable UTF-32 representation. In
practice, in handling non-ASCII text, all searching and replacement will
be in terms of substrings (likely single or sequences of grapheme
clusters).

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk