Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-18 20:43:36


At Tue, 18 Jan 2011 14:50:51 -0600,
Christian Holmquist wrote:
>
> >
> > There are two ways this could go AFAICS:
> >
> > 1. We just use std::string for UTF-8 and eventually the whole world
> > will catch up
> >
> This would be nice.
>
>
> > 2. We establish some other type for UTF-8 and *it* becomes the lingua
> > franca
> >
> > If Boost abandons std::string in interfaces that expects UTF-8,
> does that mean I as a user need to sprinkle
> boost::to_utf_8(my_std_string,...) // in whatever form to_utf8 may be
> all over my/ours (quite gigantic) code base?

Only if you're going to adopt new, currently nonexistent boost
interfaces that operate on the new utf-8 type, or if you decide to do
a wholesale adoption of that type in place of std::string. The latter
sounds like quite a huge investment for your codebase, so it's
probably not a good idea in the short-term but it might be a good
long-term move.

> Without doing so, I assume will cause compilation errors, but for what gain?
> If some code was broken before, it will remain so after I've injected all
> those to_utf8 calls as well.

I'm not talking about breaking any existing code.

> To solve actual problems I need to track the origin of my std::string's
> content, which require a traditional bug-hunting session anyway.
> No additional typed interface in the world will help me here IMO.

Help you where? It doesn't sound like you have a problem you want to
solve. If you like the status quo, don't change anything.

> Aren't things still enough of a mess out there that #2 is just as
> > likely to work well?
>
> "Just as likely to work well" doesn't sound good enough for me, from a
> maintenance point of view.

Huh? If it works just as well as the alternative, it does. If it's a
huge hassle compared to the alternative, it doesn't work. I don't
claim to know the answer to my question, but if the answer turned out
to be "it's just as likely to work well," I don't understand how you
could object.

> I can picture how the changeset looks on the poor
> branch that decides to upgrade to such a version of boost.
> The problem isn't the type, but the content.
>
> There are algorithms in stl that have requirements on their input
> (sorted, usually), why is this different?

There are also types in the STL that guarantee sortedness. If you
write an algorithm that has to do a set intersection and you're
operating on std::vectors, you have to explicitly document that
they're required to be sorted, and the user of your algorithm has to
carefully conform to that requirement without help from the compiler.
If you accept only std::sets, the requirement is implicit and enforced
by the compiler.

> I'm sure it wouldn't be supported with an introduction of
> sorted_value_input_iterator that I can pass to std::set_xxx
> functions. (?).

No, it wouldn't.

> What would be helpful if doable, is to build boost with
> BOOST_TRACK_INVALID_UTF_8, also for release builds.
> This would cause an exception or a call to user-defined function if boost
> code stumbles upon bad strings.

In my experience with Python, which uses exactly that strategy, it
works badly. The problem is that so many common strings are just
ASCII, and thus are not changed by encoding/decoding in utf-8, so it's
very easy to overlook a problem until very late in the game, and when
it *is* detected that is often very far away from the code that should
have done the encoding/decoding in the first place.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk