Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [general] What will string handling in C++ looklikeinthe future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-20 04:00:34

Next message: Lassi Tuura: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"
Previous message: Matus Chochlik: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"
In reply to: Robert Ramey: "Re: [boost] [general] What will string handling in C++ looklikeinthe future [was Always treat ... ]"
Next in thread: Beman Dawes: "Re: [boost] [general] What will string handling in C++ look like inthe future [was Always treat ... ]"

On 01/19/2011 12:56 PM, Robert Ramey wrote:
> ... elision by patrick ...
> std::string - a sequence of bytes
> utf8_string - a sequence of "code points" implemented in terms of
> std::string.
With the ability to specify a conversion facet to convert from your
local encoding to utf-8. The string would still validate the utf-8
received from the conversion facet.

What do you do about things that can validly be represented by one
character, or by a basic character with one or more combining
characters. For example Ãœ can be represented by U+00DC, a capital U
with diaeresis or by the two combining characters U+0055 U+0308, a U and
a combining diaeresis. UÌˆ<=- That one is done with two combining
characters and the previous one is just one character. The spec says
that these must be considered absolutely equivalent. Will our
utf8_string class always choose one representation over another?
Certainly to make choices like this you'd need the characterization
database from Unicode.

So, if you're iterating the utf8_string with an iterator iter, what type
does *iter return? It could _consume_ a lot of bytes.

Is it a char32_t with the character in it, is it another utf8-string
with only one character in it? I'd say char32_t because that can hold
anything in ucs.

So then what about *iter=thechar. What type or types can thechar be?

char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a
utf8_string with only one "character" to be copied in, a utf8_string and
we'll just take the first char?

I'd probably use char32_t in both those cases.

Food for thought. I agree I'd like to see it be derived from
std::string so you can pass it to things that expect a std::string and
don't care so much about encoding.

Patrick

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk