|
Boost : |
Subject: Re: [boost] [general] What willstringhandling inC++ looklike inthe future [was Always treat ... ]
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-21 12:28:12
At Thu, 20 Jan 2011 06:43:48 +0200,
Peter Dimov wrote:
>
> Dave Abrahams wrote:
> > IIUC, you're talking about changing the abstraction presented by
> > std::string to "sequence of individually addressable and mutable chars
> > that by convention represents text encoded as utf-8."
>
> Something like that. string is just char[] with value semantics. It
> doesn't necessarily hold a valid UTF-8 sequence.
Right.
> > I would prefer to be handling something that presents the abstraction
> > "character string." I'm not sure exactly what that looks like, but
> > I'm pretty sure the "individually addressable and mutable chars" part
> > should go. I'd like to see an interface that prevents corrupting the
> > underlying data such that it no longer represents a valid sequence of
> > characters (or at least makes it highly unlikely that such corruption
> > could happen accidentally). Furthermore, there are lots of string-y
> > things I'd want to do that aren't providedâor aren't provided wellâby
> > std::string, e.g. if (s1.starts_with(s2)) {...}
> >
> > Does this make more sense?
>
> It makes sense in the abstract. But there is no way to protect against
> corruption without also setting an invariant that the sequence is not
> corrupted (represents valid UTF-8), and I don't usually need such a
> string in the interfaces we're discussing, although it can certainly
> be useful on its own. The interfaces that talk to the OS need to be
> able to carry arbitrary char sequences (in the POSIX case).
Yup. Then they should be handling raw_string, right?
> Even an interface that displays the string, one that by necessity
> must interpret it as UTF-8, should preferably handle invalid UTF-8
> and display some placeholders instead of the invalid subsequence -
> it's better for the user to see parts of the string than nothing at
> all.
Yep. Then I guess that should be handling raw_string, too.
> It's even worse to abort the whole operation with an invalid_utf8
> exception.
Yowp.
So you want a "resilient utf-8 string:" something that can represent
any sequence of chars and, when interpretation is necessary, will
interpret them as utf-8, using some kind of best-effort error
recovery to avoid hard errors.
Then you can have an is_valid_utf_8() routine that is used to check
for validity when/if you need it.
I can understand the argument that there's not much to be gained from
the type system here.
I still like the idea of using something with a real string interface:
namespace boost {
struct text
{
explicit text(std::string);
operator std::string const&() const { return storage; }
...
bool startswith(text const& s) const;
bool endswith(text const& s) const;
text trim() const;
...
private:
std::string storage;
};
}
but I do wonder whether it's worth writing (or paying for the
copy in)
x.startswith(text(some_std_string))
and in general whether the cost of copying std::strings into
text::storage is too high.
-- Dave Abrahams BoostPro Computing http://www.boostpro.com
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk