Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-20 03:45:45


On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <dave_at_[hidden]> wrote:
>>
>> OK, I see. But, is there any chance that the standard itself would
>> be updated so that it first would recommend to use UTF-8 with C++
>> strings.
>
> Well, never say "never," but... never.  Such recommendations are not
> part of the standard's mission.  It doesn't do things like that.

My view of what is the standardizing comitee willing to do may by naive,
but generally I don't see why this could not be done. Other major
languages (Java, C#, etc.) picked a single "standard" encoding and
in those languages you treat text with other encodings as special case.

If C++ recommended the use of UTF-8 this would probably kickstart
the OS and compiler vendors to follow or at least to fix their implementations
of the standard libary and the OS API's to accept UTF-8 by default
(if we agree that this is a good idea).

>
>> After some period of time all other encodings would be deprecated
>
> By whom?

By the same comitee that made the recommendation in the first place.

>
>> I really see all the obstacles that prevent us from just switching
>> to UTF-8, but adding a new string class will not help for the same
>> reasons adding wstring did not help.
>
> I don't see the parallel at all.  wstring is just another container of
> bytes, for all practical purposes.  It doesn't imply any particular
> encoding, and does nothing to segregate the encoded from the raw.

Maybe wstring is not officially UTF-16 or UTF-32 or UCS, but
on most platforms it is at least treated as "the unicode string"
regardless of this being a vague term. What I am afraid of is
that just like the use of wchar_t and wstring spawned the dual
interface used by Winapi and followed by many others (including
myself in the past), introducing a third (semi-)standard string class
will spawn a "ternary" interface (but I may be wrong or mixing the
order of the events mentienoed above).

>
>> As I already said elsewhere I think that this is a problem that has
>> to be solved "organizationally".
>
> Perhaps.  The type system is one of our organizational tools, and
> Boost has an impact insofar as it produces components that people use,
> so if we aren't able to produce some flagship library components that
> help with the solution, we have little traction.

I believe in strong typing, but .. OK, for the sake of argument, where
do we imagine utf8_t (or whatever its name will be) will be used and
what is out long-term plan for std::string?

If I design a library or an application should I use utf8_t everywhere ?
As the type of the class' member variables, parameters of functions
and constructors or should I stick to std::string (or perhaps wstring)
for maximum compatibility with the rest of the world ?

>
>> >> > *Scenario E:* We add another string class and everyone adopts it
>> >>
>> I meant that for example on POSIX OSes the POSIX C-API
>> did not have to be changed or extended by a new set of functions
>> doing the same things, but using a new character type, when they
>> switched from the old encodings to UTF-8.
>
> ...and people still have the problem that they lose track of what's
> "raw" and what's encoded as utf-8.

Yes, but in the end, they will get used to it. There are many dangerous
things in C++ (like for example dereferencing a nil or dangling pointer,
doing C-pointer arithmetic in the presence of inheritance, etc.) you
should not do and mixing UTF-8 and other encoding would be one
of them. It is a breaking change but it would not be the first one in
C++'s history.

>
>> To compare two strings you still can use stdcmp and not utf8strcmp,
>> to collate strings you use strcoll and not utf8strcol, etc.
>
> Yeah... but surely POSIX's strcmp only tells you whether the two
> strings have the same sequence of code points, not whether they have
> the same characters, right?  And if you inadvertently compare a "raw"
> string with an equivalent utf-8-encoded string, what happens?

Undefined behavior, your application segfaults, aborts, silently fails...
(what happens if you dereference a dangling pointer ?)

BR,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk