Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-22 14:53:27


On Sat, Jan 22, 2011 at 12:36 AM, Patrick Horgan <phorgan1_at_[hidden]> wrote:
> On 01/21/2011 01:54 AM, Matus Chochlik wrote:
>>
>> ... elision by patrick...
>> Why not boost::string (explicitly stating in the docs that it is
>> UTF-8-based) ?
>> the name u8string suggests to me that it is meant for some special case
>> of character encoding and the (encoding agnostic/native) std::string
>> is still the way
>> to go.
>
> I think that's the truth.  std::string has some performance guarantees that
> a utf-8 based string wouldn't be able to keep.  std::string can do things,
> and people do things with std::string that a utf-8 based string can't do.

If this was really the case then what you describe would be already happening
on all the platforms that use the UTF-8 encoding by default for any locale.

>  If you set LC_COLLATE to en_US.utf8 or the equivalent (I hate the way
> locale names are not as standardized as you might like), then most of the
> standard algorithms will be locale aware and operations on your string will
> be muchly aware of the string encoding.  By switching locales, you can then
> operate on strings with other encodings.  utf-8_string isn't intended to
> operate like that.  It's specialized.
>>
>> IMO we should send the message that UTF-8 is
>> "normal"/"(semi-)standard"/"de-facto-standard"
>> and the other encodings like the native_t (or even ansi_t,
>> ibm_cp_xyz_t, string16_t,
>> string32_t, ...) are the special cases and they should be treated as such.
>
> Why would people want to lose so much of the functionality of std::string?

What functionality would they loose exactly ? Again, on many platforms
the default encoding for all (or nearly all) locales already is UTF-8 so
if you get a string from the OS API and store it into a std::string then
it is UTF-8 encoded. I do a equal share of programming on Windows
and Linux platforms and I have yet to run into these problems you
describe on Linux where for some time now the default encoding is UTF-8.
Actually today I encounter more problems on Windows, where I can't
set the locale to use UTF-8 and consequently I have to transcode data
from socket connections of files manually.

If you are talking about being able to have indexed random-access
to "logical characters" for example on Windows with some special
encodings, then this is only a platform-specific and unportable functionality.
What I propose, is to extend the interface so that it would allow you
handle the "raw-byte-sequences" that are now used to represent strings
of logical characters in a platform independent way by using the Unicode
standard.

>  The only advantage of a utf8_string would be automatic and continual
> verification that it's a valid utf-8 encoded string that otherwise acts as
> much as possible like a std::string.  For that you would give up a lot of
> other functionality.

Again what exactly would you give up? The gain is not only what you describe,
but also that, for example when writing text into a file in a portable
application,
sending the file to a different machine with a different platform you
can read the
string on that other machine without explicit transcoding (which means picking
a library/tool that can do the transcoding and use it explicitly everywhere you
potentially handle data that may come from different platforms).

BR,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk