Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-19 09:08:06
On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson
> On Wed, 19 Jan 2011 11:33:02 +0100
> Matus Chochlik <chochlik_at_[hidden]> wrote:
>> *Scenario A:*
> Sounds like a little slice of heaven to me. Though you'll still have
> the pesky problem of having to verify that the UTF-8 code is valid all
> the time. More on that below.
I am a believer ;) and when people realize that UTF-8 is the way to
go, the pesky problems will vanish. Believe me today with ANSI
Today I have to check/detect the encoding of input files created
by users on different windows machines and do the conversions.
And checking if data is valid UTF-8 is IMO an easier task.
Most people here use windows1252 that is not so different
from ASCII so even if something gets garbled it can be rescued.
I can't imagine what it is like in countries that have to deal with
semitic languages, chinese/japanese/korean ideograms, etc.
>> *Scenario B:*
> How is that different from what we've got today, except that the utf*_t
> classes will make converting to and from different string types, and
> validating the UTF code, a little easier and more automatic?
Exactly, and I think that we agree that the current status is far from
ideal. The automatic conversions would (probably) be OK but
introducing yet another string class is not.
>> Also half of the cpu time assigned to running that application will
>> be wasted on useless string transcoding. And half of the memory will
>> be occupied with useless transcoding-related code and data.
> I think that's a bit of an exaggeration. :-) As more libraries move to
Yes, sorry I could not resist :)
> the assumption that std::string == UTF-8, the need (and code) for
> transcoding will silently vanish. Eventually, utf8_t will just be a
> statement by the programmer that the data contained within is
> guaranteed to be valid UTF-8, enforced by the class -- something that
> would require at minimum an extra call if using std::string, one that
> could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power
and influence Boost has. Again I know that it would break a lot
of stuff but really are all those people that now use std::string ready
to change all their code to use utf8_t instead ? Which will involve
more work ? I'm convinced that it will be the latter, but I can be wrong.
And many people already *do* use std::string for UTF-8 and are
doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing"
them because we want them, for the sake of people which still dwell
on ANSI, to change their code. IMO we should do the opposite.
>> [...] - Once we overcome the troubled period of transition everything
>> will be just great. No headaches related to file encoding detection
>> and transcoding.
> It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out
that a large portion of the code is completely encoding agnostic
so there would be no impact if we stayed with std::string. There
would be, if we add utf8_t.
>> Think about what will happen after we accept IPV6 and drop IPV4. The
>> process will be painful but after it is done, there will be no more
>> NAT, and co. and the whole network infrastructure will be simplified.
> That's a problem I've been watching carefully for many years now, and I
> don't see that happening. ISPs will switch to IPv6 (because they have
> to), and make it possible for their customers to stay on IPv4, so their
> customers *will* stay on IPv4 because it's cheaper. And if they stay
> with IPv4, there won't be any impetus for consumer electronics
> companies to make their equipment IPv6-compatible because consumers
> won't care about it. Without consumer demand, it won't get done for
> years, maybe a decade or more.
> That's what I see happening with std::string and UTF-8 as well.
Yes, people (me included) are resistant to big changes event for the better.
But I've learned that I should always consider the long-term impact.
>> - Creating another string class, which, let us face it, not everybody
>> will accept even with the Boost influence unless it becomes standard.
> That's the beauty of it -- not everybody *needs* to accept it. Just the
> people who write code that isn't encoding-agnostic. Boost.FileSystem
> might provide a utf16_t overload for Windows, for instance, so that it
> can automatically convert strings in other UTF types. But I see no
> reason it would lose the existing interface.
So you suggest that for example in the STL there would be
(for example) besides the existing fstream and wfstream also
a third "ufstream". I think that we actually should be reducing
the interface not expanding it (yes I hear it ... "breaking changes!" :)).
>> - We will abandon std::string and be stuck with utf8_t which I
>> *personally* already dislike :)
> Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
> I hate to point this out, but people are *already* using other
> programming languages. :-) C++ isn't new or sexy, and has some
> pain-points (though many of the most egregious ones will be solved with
> C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
> types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting
the problem away, not solving it really.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk