|
Boost : |
From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-09-27 16:37:09
"James Porter" <porterj_at_[hidden]> writes:
> I see what you mean. Still, fixed-width-encoded strings are a lot easier to
> code, and I think we should focus on them first just to get something
> working and to have a platform to test code conversion on, which in my
> opinion is the most important part.
I think as others have said, in practice a fixed-width encoding really
gains you very little or nothing at all. Needing random access to code
points is, I think, an extremely rare operation. Replacing one code
point with another code point is also likewise a rare operation; in
general you would replace one substring (perhaps a grapheme cluster)
with another substring (which may also be a grapheme cluster).
[snip]
> That said, I think a good (general) roadmap for this project would be:
> 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though
> string constants may pose a problem)
UCS-2 is bogus and should not be used at all. Conceivably UCS-4 is
legitimate but in practice not likely to be used by anyone. Still, it
is probably important to support it. The primary encodings of Unicode
to be supported should be UTF-8 and UTF-16.
> 2) Add code conversion to move between encodings, especially for I/O
> 3) Create VWE string class (fairly easy if immutable, hard if mutable)
I don't think the issues of a mutable UTF-8/UTF-16 representation are
very different from the issues of a mutable UTF-32 representation. In
practice, in handling non-ASCII text, all searching and replacement will
be in terms of substrings (likely single or sequences of grapheme
clusters).
-- Jeremy Maitin-Shepard
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk