Boost logo

Boost :

Subject: Re: [boost] [general] What willstringhandling inC++ looklike inthe future [was Always treat ... ]
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-21 12:28:12


At Thu, 20 Jan 2011 06:43:48 +0200,
Peter Dimov wrote:
>
> Dave Abrahams wrote:
> > IIUC, you're talking about changing the abstraction presented by
> > std::string to "sequence of individually addressable and mutable chars
> > that by convention represents text encoded as utf-8."
>
> Something like that. string is just char[] with value semantics. It
> doesn't necessarily hold a valid UTF-8 sequence.

Right.

> > I would prefer to be handling something that presents the abstraction
> > "character string." I'm not sure exactly what that looks like, but
> > I'm pretty sure the "individually addressable and mutable chars" part
> > should go. I'd like to see an interface that prevents corrupting the
> > underlying data such that it no longer represents a valid sequence of
> > characters (or at least makes it highly unlikely that such corruption
> > could happen accidentally). Furthermore, there are lots of string-y
> > things I'd want to do that aren't provided—or aren't provided well—by
> > std::string, e.g. if (s1.starts_with(s2)) {...}
> >
> > Does this make more sense?
>
> It makes sense in the abstract. But there is no way to protect against
> corruption without also setting an invariant that the sequence is not
> corrupted (represents valid UTF-8), and I don't usually need such a
> string in the interfaces we're discussing, although it can certainly
> be useful on its own. The interfaces that talk to the OS need to be
> able to carry arbitrary char sequences (in the POSIX case).

Yup. Then they should be handling raw_string, right?

> Even an interface that displays the string, one that by necessity
> must interpret it as UTF-8, should preferably handle invalid UTF-8
> and display some placeholders instead of the invalid subsequence -
> it's better for the user to see parts of the string than nothing at
> all.

Yep. Then I guess that should be handling raw_string, too.

> It's even worse to abort the whole operation with an invalid_utf8
> exception.

Yowp.

So you want a "resilient utf-8 string:" something that can represent
any sequence of chars and, when interpretation is necessary, will
interpret them as utf-8, using some kind of best-effort error
recovery to avoid hard errors.

Then you can have an is_valid_utf_8() routine that is used to check
for validity when/if you need it.

I can understand the argument that there's not much to be gained from
the type system here.

I still like the idea of using something with a real string interface:

  namespace boost {

     struct text
     {
         explicit text(std::string);
         operator std::string const&() const { return storage; }

         ...
         bool startswith(text const& s) const;
         bool endswith(text const& s) const;
         text trim() const;
         ...

      private:
         std::string storage;
     };
  }

but I do wonder whether it's worth writing (or paying for the
copy in)

    x.startswith(text(some_std_string))

and in general whether the cost of copying std::strings into
text::storage is too high.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk