Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-19 14:50:25

On Wed, 19 Jan 2011 15:08:06 +0100
Matus Chochlik <chochlik_at_[hidden]> wrote:

>> How is that different from what we've got today, except that the
>> utf*_t classes will make converting to and from different string
>> types, and validating the UTF code, a little easier and more
>> automatic?
> Exactly, and I think that we agree that the current status is far from
> ideal. The automatic conversions would (probably) be OK but
> introducing yet another string class is not.

Do you see another way to provide those conversions, and automatic
verification of proper UTF coding? (Automatic verification is a very
good thing, without it someone won't use it or will forget to, and open
up their programs to exploitation.)

>> the assumption that std::string == UTF-8, the need (and code) for
>> transcoding will silently vanish. Eventually, utf8_t will just be a
>> statement by the programmer that the data contained within is
>> guaranteed to be valid UTF-8, enforced by the class -- something that
>> would require at minimum an extra call if using std::string, one that
>> could be forgotten and open up the program to exploits.
> Yes but why do not enforce it "organizationally" with the power
> and influence Boost has. Again I know that it would break a lot
> of stuff but really are all those people that now use std::string
> ready to change all their code to use utf8_t instead ? Which will
> involve more work ? I'm convinced that it will be the latter, but I
> can be wrong.

If Boost comes out with a version that breaks existing programs,
companies just won't upgrade to it. I can keep one of the companies
that mine works with upgrading, because the group that I work with is
the only one there using C++ and they listen to me, but most companies
have a lot more invested in the existing system. Believe me, any
breaking changes have to be eased in over many versions -- the "boiling
a frog" approach. :-)

> And many people already *do* use std::string for UTF-8 and are doing
> the "right" (sorry :)) thing, by introducing utf8_t we are
> "punishing" them because we want them, for the sake of people which
> still dwell on ANSI, to change their code. IMO we should do the
> opposite.

If they're already using UTF-8 strings, then we provide something like
BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes
configure themselves to accept std::strings as UTF-8-encoded, and any
changes are completely transparent to those people. No punishment

For everyone else, we introduce the utf*_t API alongside the
std::string one, for those classes and functions that are not
encoding-agnostic. The std::string one can be deprecated in future
versions if the library author desires. Again, no punishment involved.

>>> [...] - Once we overcome the troubled period of transition
>>> everything will be just great. No headaches related to file
>>> encoding detection and transcoding.
>> It's the getting-there part that I'm concerned about.
> Me too, but again many other people already pointed out
> that a large portion of the code is completely encoding agnostic
> so there would be no impact if we stayed with std::string. There
> would be, if we add utf8_t.

Those portions of the code that are encoding-agnostic can continue
using std::string, and nothing changes. It's only the functions that
need to know the encoding that would change, and that change can be

>>> Think about what will happen after we accept IPV6 and drop IPV4. The
>>> process will be painful but after it is done, there will be no more
>>> NAT, and co. and the whole network infrastructure will be
>>> simplified.
>> That's a problem I've been watching carefully for many years now,
>> and I don't see that happening. [...]
> Yes, people (me included) are resistant to big changes event for the
> better. But I've learned that I should always consider the long-term
> impact.

As have I. :-) I think the design I'm proposing is low-impact enough
that people will adopt it. Slowly, but they will.

>> That's the beauty of it -- not everybody *needs* to accept it. Just
>> the people who write code that isn't encoding-agnostic.
>> Boost.FileSystem might provide a utf16_t overload for Windows, for
>> instance, so that it can automatically convert strings in other UTF
>> types. But I see no reason it would lose the existing interface.
> So you suggest that for example in the STL there would be (for
> example) besides the existing fstream and wfstream also a third
> "ufstream". I think that we actually should be reducing the interface
> not expanding it (yes I hear it ... "breaking changes!" :)).

I don't expect that the utf*_t classes will make it into the standard.
They definitely won't make it into the now-misnamed C++0x standard, and
it'll likely be another ten years before another one is hashed out --
by then, the UTF-8 conversion should be complete, so there will be no
need for it, except possibly to confirm that a string isn't malformed.

>>> - We will abandon std::string and be stuck with utf8_t which I
>>> *personally* already dislike :)
>> Any technical reason why, other than what you've already written?
> Besides the ugly name and that is a new class ? No :)

If you can think of a more-acceptable-but-still-descriptive name for
it, I'm all ears. :-)

>> I hate to point this out, but people are *already* using other
>> programming languages. :-) C++ isn't new or sexy, and has some
>> pain-points (though many of the most egregious ones will be solved
>> with C++0x). Unicode handling is one of them, and in my opinion, the
>> utf*_t types will only ease that.
> And the solution is long overdue. And creating utf8_t is just putting
> the problem away, not solving it really.

I see it as merely easing the transition.

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at