Boost logo

Boost :

Subject: Re: [boost] [general] What will stringhandling inC++ looklike inthe future [was Always treat ... ]
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-19 23:13:45


At Thu, 20 Jan 2011 00:07:18 +0200,
Peter Dimov wrote:
>
> Dave Abrahams wrote:
> > At Wed, 19 Jan 2011 23:02:02 +0200,
> > Peter Dimov wrote:
> > > My answer is different. T is std::string, and:
> > > > - on POSIX OSes, this string is taken directly from the OS and
> > > > given
> > > > directly to the OS, without any conversion;
> > > > - on Windows, this string is UTF-8 and is converted to UTF-16
> > > before
> > > being given to the OS, and converted from UTF-16 after being received
> > > from it. This conversion should tolerate broken UTF-16 because the OS
> > > does so as well.
>
> ...
>
> > I prefer to have semantic constraints/invariants like "this is UTF-8
> > encoded" represented in the type system and enforced by public library
> > interfaces. I'm arguing for a future like that.
>
> But the semantics I outlined above only have this constraint under
> Windows.

Sorry, I don't understand what you're saying here.

But let me say a little more about my point; maybe that will help. If
I get a std::string from "somewhere", I don't know what encoding it's
in, if any. The abstraction presented by std::string is essentially
"sequence of individually addressable and mutable chars that by
convention represents text in some unspecified way." It has lots of
interface that is aimed at manipulating the raw sequence of chars, and
none that helps with an interpretation of those chars.

IIUC, you're talking about changing the abstraction presented by
std::string to "sequence of individually addressable and mutable chars
that by convention represents text encoded as utf-8."

I would prefer to be handling something that presents the abstraction
"character string." I'm not sure exactly what that looks like, but
I'm pretty sure the "individually addressable and mutable chars" part
should go. I'd like to see an interface that prevents corrupting the
underlying data such that it no longer represents a valid sequence of
characters (or at least makes it highly unlikely that such corruption
could happen accidentally). Furthermore, there are lots of string-y
things I'd want to do that aren't provided—or aren't provided well—by
std::string, e.g. if (s1.starts_with(s2)) {...}

Does this make more sense?

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk