Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-19 09:08:06

Next message: Michel MORIN: "Re: [boost] [Boost-commit] svn:boost r67792 - in branches/release: boost/iterator"
Previous message: Ian Emmons: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"
In reply to: Chad Nelson: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"
Next in thread: Edward Diener: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"
Reply: Edward Diener: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"
Reply: Chad Nelson: "Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]"

On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson
<chad.thecomfychair_at_[hidden]> wrote:
> On Wed, 19 Jan 2011 11:33:02 +0100
> Matus Chochlik <chochlik_at_[hidden]> wrote:
>
>>
>> *Scenario A:*
>
> Sounds like a little slice of heaven to me. Though you'll still have
> the pesky problem of having to verify that the UTF-8 code is valid all
> the time. More on that below.

I am a believer ;) and when people realize that UTF-8 is the way to
go, the pesky problems will vanish. Believe me today with ANSI

Today I have to check/detect the encoding of input files created
by users on different windows machines and do the conversions.
And checking if data is valid UTF-8 is IMO an easier task.

Most people here use windows1252 that is not so different
from ASCII so even if something gets garbled it can be rescued.
I can't imagine what it is like in countries that have to deal with
semitic languages, chinese/japanese/korean ideograms, etc.

>
>> *Scenario B:*
>>
>
> How is that different from what we've got today, except that the utf*_t
> classes will make converting to and from different string types, and
> validating the UTF code, a little easier and more automatic?

Exactly, and I think that we agree that the current status is far from
ideal. The automatic conversions would (probably) be OK but
introducing yet another string class is not.

>
>> Also half of the cpu time assigned to running that application will
>> be wasted on useless string transcoding. And half of the memory will
>> be occupied with useless transcoding-related code and data.
>
> I think that's a bit of an exaggeration. :-) As more libraries move to

Yes, sorry I could not resist :)

> the assumption that std::string == UTF-8, the need (and code) for
> transcoding will silently vanish. Eventually, utf8_t will just be a
> statement by the programmer that the data contained within is
> guaranteed to be valid UTF-8, enforced by the class -- something that
> would require at minimum an extra call if using std::string, one that
> could be forgotten and open up the program to exploits.

Yes but why do not enforce it "organizationally" with the power
and influence Boost has. Again I know that it would break a lot
of stuff but really are all those people that now use std::string ready
to change all their code to use utf8_t instead ? Which will involve
more work ? I'm convinced that it will be the latter, but I can be wrong.

And many people already *do* use std::string for UTF-8 and are
doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing"
them because we want them, for the sake of people which still dwell
on ANSI, to change their code. IMO we should do the opposite.

>> [...] - Once we overcome the troubled period of transition everything
>> will be just great. No headaches related to file encoding detection
>> and transcoding.
>
> It's the getting-there part that I'm concerned about.

Me too, but again many other people already pointed out
that a large portion of the code is completely encoding agnostic
so there would be no impact if we stayed with std::string. There
would be, if we add utf8_t.

>
>> Think about what will happen after we accept IPV6 and drop IPV4. The
>> process will be painful but after it is done, there will be no more
>> NAT, and co. and the whole network infrastructure will be simplified.
>
> That's a problem I've been watching carefully for many years now, and I
> don't see that happening. ISPs will switch to IPv6 (because they have
> to), and make it possible for their customers to stay on IPv4, so their
> customers *will* stay on IPv4 because it's cheaper. And if they stay
> with IPv4, there won't be any impetus for consumer electronics
> companies to make their equipment IPv6-compatible because consumers
> won't care about it. Without consumer demand, it won't get done for
> years, maybe a decade or more.
>
> That's what I see happening with std::string and UTF-8 as well.

Yes, people (me included) are resistant to big changes event for the better.
But I've learned that I should always consider the long-term impact.

>> - Creating another string class, which, let us face it, not everybody
>> will accept even with the Boost influence unless it becomes standard.
>
> That's the beauty of it -- not everybody *needs* to accept it. Just the
> people who write code that isn't encoding-agnostic. Boost.FileSystem
> might provide a utf16_t overload for Windows, for instance, so that it
> can automatically convert strings in other UTF types. But I see no
> reason it would lose the existing interface.

So you suggest that for example in the STL there would be
(for example) besides the existing fstream and wfstream also
a third "ufstream". I think that we actually should be reducing
the interface not expanding it (yes I hear it ... "breaking changes!" :)).

>
>> - We will abandon std::string and be stuck with utf8_t which I
>> *personally* already dislike :)
>
> Any technical reason why, other than what you've already written?

Besides the ugly name and that is a new class ? No :)

> I hate to point this out, but people are *already* using other
> programming languages. :-) C++ isn't new or sexy, and has some
> pain-points (though many of the most egregious ones will be solved with
> C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
> types will only ease that.

And the solution is long overdue. And creating utf8_t is just putting
the problem away, not solving it really.

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk