Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ looklikeinthe future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-20 04:00:34


On 01/19/2011 12:56 PM, Robert Ramey wrote:
> ... elision by patrick ...
> std::string - a sequence of bytes
> utf8_string - a sequence of "code points" implemented in terms of
> std::string.
With the ability to specify a conversion facet to convert from your
local encoding to utf-8. The string would still validate the utf-8
received from the conversion facet.

What do you do about things that can validly be represented by one
character, or by a basic character with one or more combining
characters. For example Ü can be represented by U+00DC, a capital U
with diaeresis or by the two combining characters U+0055 U+0308, a U and
a combining diaeresis. Ü<=- That one is done with two combining
characters and the previous one is just one character. The spec says
that these must be considered absolutely equivalent. Will our
utf8_string class always choose one representation over another?
Certainly to make choices like this you'd need the characterization
database from Unicode.

So, if you're iterating the utf8_string with an iterator iter, what type
does *iter return? It could _consume_ a lot of bytes.

Is it a char32_t with the character in it, is it another utf8-string
with only one character in it? I'd say char32_t because that can hold
anything in ucs.

So then what about *iter=thechar. What type or types can thechar be?

char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a
utf8_string with only one "character" to be copied in, a utf8_string and
we'll just take the first char?

I'd probably use char32_t in both those cases.

Food for thought. I agree I'd like to see it be derived from
std::string so you can pass it to things that expect a std::string and
don't care so much about encoding.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk