Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-19 14:28:09


At Wed, 19 Jan 2011 20:03:59 +0100,
Matus Chochlik wrote:
>
> On Wed, Jan 19, 2011 at 7:39 PM, Dave Abrahams <dave_at_[hidden]> wrote:
> >
> > Our influence, if we introduce new library components, is very great,
> > because they're on a de-facto fast track to standardization, and an
> > improved string library is exactly the sort of thing that would be
> > adopted upstream.  If we simply agree to a programming convention,
> > that will have some impact, but much less.
>
> OK, I see. But, is there any chance that the standard itself would
> be updated so that it first would recommend to use UTF-8 with C++
> strings.

Well, never say "never," but... never. Such recommendations are not
part of the standard's mission. It doesn't do things like that.

> After some period of time all other encodings would be deprecated

By whom?

> and using them would cause undefined behavior. Could Boost be the
> driving force here?

This doesn't seem like a very plausible scenario to me, based on my
experience. Of course, others may disagree.

> I really see all the obstacles that prevent us from just switching
> to UTF-8, but adding a new string class will not help for the same
> reasons adding wstring did not help.

I don't see the parallel at all. wstring is just another container of
bytes, for all practical purposes. It doesn't imply any particular
encoding, and does nothing to segregate the encoded from the raw.

> As I already said elsewhere I think that this is a problem that has
> to be solved "organizationally".

Perhaps. The type system is one of our organizational tools, and
Boost has an impact insofar as it produces components that people use,
so if we aren't able to produce some flagship library components that
help with the solution, we have little traction.

> >> > *Scenario E:* We add another string class and everyone adopts it
> >>
> >> Ok I admit that this is possible. But let me ask: How did the C world
> >> made the transition without abandoning char ?
> >
> > The transition from what to what?
>
> I meant that for example on POSIX OSes the POSIX C-API
> did not have to be changed or extended by a new set of functions
> doing the same things, but using a new character type, when they
> switched from the old encodings to UTF-8.

...and people still have the problem that they lose track of what's
"raw" and what's encoded as utf-8.

> To compare two strings you still can use stdcmp and not utf8strcmp,
> to collate strings you use strcoll and not utf8strcol, etc.

Yeah... but surely POSIX's strcmp only tells you whether the two
strings have the same sequence of code points, not whether they have
the same characters, right? And if you inadvertently compare a "raw"
string with an equivalent utf-8-encoded string, what happens?

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk