Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-23 12:42:55


On 01/22/2011 11:53 AM, Matus Chochlik wrote:
> On Sat, Jan 22, 2011 at 12:36 AM, Patrick Horgan<phorgan1_at_[hidden]> wrote:
>> On 01/21/2011 01:54 AM, Matus Chochlik wrote:
>>> ... elision by patrick...
>>> Why not boost::string (explicitly stating in the docs that it is
>>> UTF-8-based) ?
>>> the name u8string suggests to me that it is meant for some special case
>>> of character encoding and the (encoding agnostic/native) std::string
>>> is still the way
>>> to go.
>> I think that's the truth. std::string has some performance guarantees that
>> a utf-8 based string wouldn't be able to keep. std::string can do things,
>> and people do things with std::string that a utf-8 based string can't do.
> If this was really the case then what you describe would be already happening
> on all the platforms that use the UTF-8 encoding by default for any locale.
No. They're using std::string. It works just fine for this as it does
for other things. It's performance guarantees are in respect to their
templated data type, not in terms of the encoding of the contents.
std::string lets you walk through a JIS string and decode it. A utf-8
string would hurl chunks since, of course, it wouldn't be encoded as
utf-8. I could go on and on, but perhaps if you'd refresh yourself on
the interface of std::string and think about the implications on that if
you had a validating utf-8 string you'd see. I'm really in favor of an
utf-8 string, I just wouldn't call it string because that would be a
lie. It wouldn't be a general string, but a special case of string.

> ... elision by me.
> What functionality would they loose exactly ? Again, on many platforms
> the default encoding for all (or nearly all) locales already is UTF-8 so
> if you get a string from the OS API and store it into a std::string then
> it is UTF-8 encoded. I do a equal share of programming on Windows
> and Linux platforms and I have yet to run into these problems you
> describe on Linux where for some time now the default encoding is UTF-8.
> Actually today I encounter more problems on Windows, where I can't
> set the locale to use UTF-8 and consequently I have to transcode data
> from socket connections of files manually.

I didn't say ever, that having utf-8 encoded characters in a std::string
would cause you some kind of problems. I don't think I even said
anything that would let you infer that. You're completely off base
here. I was talking about a string specialized for utf-8 encoded
characters. You're chasing a red-herring. You're stalking a strawman
target. I agree with you entirely. std::string does a great job of
holding utf-8 encoded characters, as well as many other things.

> If you are talking about being able to have indexed random-access
> to "logical characters" for example on Windows with some special
> encodings, then this is only a platform-specific and unportable functionality.
> What I propose, is to extend the interface so that it would allow you
> handle the "raw-byte-sequences" that are now used to represent strings
> of logical characters in a platform independent way by using the Unicode
> standard.

That's nice. I vote for that idea. Just don't call it std::string,
because it won't be, and you won't be able to do everything with it that
std::string does today.

>> The only advantage of a utf8_string would be automatic and continual
>> verification that it's a valid utf-8 encoded string that otherwise acts as
>> much as possible like a std::string. For that you would give up a lot of
>> other functionality.
> Again what exactly would you give up? The gain is not only what you describe,
> but also that, for example when writing text into a file in a portable
> application,
> sending the file to a different machine with a different platform you
> can read the
> string on that other machine without explicit transcoding (which means picking
> a library/tool that can do the transcoding and use it explicitly everywhere you
> potentially handle data that may come from different platforms).

That's an advantage of a utf-8 encoded file. You don't need a special
string type to write to that. Before writing to the file you can hold
the data in memory in a std::string or a C string, or a chunk of mmap'd
memory today, and if they contain data encoded in utf-8 you have the
same advantage.

There's a great advantage to a utf-8_string, in that as long as it
always validates, you never have to check the data again for
correctness. None of the routines with interfaces written in terms of
it would have to do any validating for utf-8 encoding correctness and
could worry about their own missions solely. I _would_ like to be able
to choose two types of behavior because sometimes I would want it to
throw on an invalid sequence, but other's I'd like it to substitute an
indicator of an invalid character so I don't throw out the baby with the
bath water. Both of these types of reactions are talked about in the
utf-8 spec.

A string specialized to only hold utf-8 encoded data wouldn't be any
good to someone not using utf-8 encoding. Even if they were using 7-bit
ascii for a network application, like ftp, for example, they'd have to
pay the penalty for validating that the 7-bit ascii was valid utf-8. If
they're using it as a container to carry around encrypted strings, well
that wouldn't be possible at all.

If a system call returned a name valid in that operating system that I
would later pass to another system call, if it wasn't utf-8 what could I
do? Break? Or corrupt the string?

utf-8 encoding is useful, but it's not the majority of the text used in
the world. Many applications in the world are not internationalized at
all and will never be. They are written for one specific place on earth
and that's all they want to do. In twenty years you'll see just as many
applications using EUC as today. Shoot, COBOL hasn't gone away.

I like to use it as my interface to the world, for, just as you said, a
portable file, or a web page. I don't use it internally in my code
unless I'm just carrying something from one external interface to
another. I would LOVE to have a validating utf-8_string. It would be
really useful in a web app.

Vayo con Diós,

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk