Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-19 08:39:28
On Wed, 19 Jan 2011 11:33:02 +0100
Matus Chochlik <chochlik_at_[hidden]> wrote:
> The string-encoding-related discussion boils down for me to the
> following: What fill the string handling in C++ look like in the
> (maybe not immediate) future.
> *Scenario A:*
> We will pick a widely-accepted char-based encoding [...] and use that
> with std::strings which will become the one and only text string
> 'container' class.
> All the wstrings, wxString, Qstrings, utf8strings, etc. will be
> abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out
> with the help of convenience classes like ansi_str_t and ucs2_t that
> will be made obsolete and finally dropped (after the transition).
Sounds like a little slice of heaven to me. Though you'll still have
the pesky problem of having to verify that the UTF-8 code is valid all
the time. More on that below.
> *Scenario B:*
> We will add yet another string class named utf8_t to the already
> crowded set named above. [...] Now an application using libraries
> [a..z] will become the developers nightmare. What string should he
> use for the class members, constructor parameters, who to do when
> the conversions do not work so seamlesly ?
How is that different from what we've got today, except that the utf*_t
classes will make converting to and from different string types, and
validating the UTF code, a little easier and more automatic?
> Also half of the cpu time assigned to running that application will
> be wasted on useless string transcoding. And half of the memory will
> be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to
the assumption that std::string == UTF-8, the need (and code) for
transcoding will silently vanish. Eventually, utf8_t will just be a
statement by the programmer that the data contained within is
guaranteed to be valid UTF-8, enforced by the class -- something that
would require at minimum an extra call if using std::string, one that
could be forgotten and open up the program to exploits.
> *Scenario C:*
> This is basically the status quo; a mix of the above. A sad and
> unsatisfactory state of things.
> *Consequences of A:*
> [...] - Once we overcome the troubled period of transition everything
> will be just great. No headaches related to file encoding detection
> and transcoding.
It's the getting-there part that I'm concerned about.
> Think about what will happen after we accept IPV6 and drop IPV4. The
> process will be painful but after it is done, there will be no more
> NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I
don't see that happening. ISPs will switch to IPv6 (because they have
to), and make it possible for their customers to stay on IPv4, so their
customers *will* stay on IPv4 because it's cheaper. And if they stay
with IPv4, there won't be any impetus for consumer electronics
companies to make their equipment IPv6-compatible because consumers
won't care about it. Without consumer demand, it won't get done for
years, maybe a decade or more.
That's what I see happening with std::string and UTF-8 as well.
> *Consequences of B:*
> - No fixing of existing interface which IMO means no or very slow
> moving on to a single encoding.
Which, as stated above, I believe will happen anyway.
> - Creating another string class, which, let us face it, not everybody
> will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the
people who write code that isn't encoding-agnostic. Boost.FileSystem
might provide a utf16_t overload for Windows, for instance, so that it
can automatically convert strings in other UTF types. But I see no
reason it would lose the existing interface.
> - We will abandon std::string and be stuck with utf8_t which I
> *personally* already dislike :)
Any technical reason why, other than what you've already written?
> - People will probably start to use other programming languages
> (although this may by FUD)
I hate to point this out, but people are *already* using other
programming languages. :-) C++ isn't new or sexy, and has some
pain-points (though many of the most egregious ones will be solved with
C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
types will only ease that.
> *Note on the encoding to be used*
> The best candidate for the widely-accepted and extensible encoding
> vaguely mentioned above is IMO UTF-8. [...]
Apparently a growing number of people agree, as do I.
> - It is extensible, so once we have done the painful transition we
> will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte
> sequences to encode code points, but the scheme is transparently
> extensible to 1-N bytes (unlike UCS-X and i'm not sure about
> UTF-16/32). [...]
UTF-16 can't be extended any further than its current definition, not
without a major reinterpretation. UTF-32 (and UTF-8) could go up to
0xFFFFFFFF codepoints, but the standards bodies involved have agreed
that they'll never be extended past the current UTF-16 limitations.
Though of course, that's subject to change if the circumstances change,
though nobody foresees such a change right now.
-- Chad Nelson Oak Circle Software, Inc. * * *
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk