|
Boost : |
Subject: Re: [boost] [general] What will string handling in C++ looklikeinthe future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-20 04:00:34
On 01/19/2011 12:56 PM, Robert Ramey wrote:
> ... elision by patrick ...
> std::string - a sequence of bytes
> utf8_string - a sequence of "code points" implemented in terms of
> std::string.
With the ability to specify a conversion facet to convert from your
local encoding to utf-8. The string would still validate the utf-8
received from the conversion facet.
What do you do about things that can validly be represented by one
character, or by a basic character with one or more combining
characters. For example à can be represented by U+00DC, a capital U
with diaeresis or by the two combining characters U+0055 U+0308, a U and
a combining diaeresis. UÌ<=- That one is done with two combining
characters and the previous one is just one character. The spec says
that these must be considered absolutely equivalent. Will our
utf8_string class always choose one representation over another?
Certainly to make choices like this you'd need the characterization
database from Unicode.
So, if you're iterating the utf8_string with an iterator iter, what type
does *iter return? It could _consume_ a lot of bytes.
Is it a char32_t with the character in it, is it another utf8-string
with only one character in it? I'd say char32_t because that can hold
anything in ucs.
So then what about *iter=thechar. What type or types can thechar be?
char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a
utf8_string with only one "character" to be copied in, a utf8_string and
we'll just take the first char?
I'd probably use char32_t in both those cases.
Food for thought. I agree I'd like to see it be derived from
std::string so you can pass it to things that expect a std::string and
don't care so much about encoding.
Patrick
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk