Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-18 20:50:53

On Tue, 18 Jan 2011 14:50:51 -0600
Christian Holmquist <c.holmquist_at_[hidden]> wrote:

>> There are two ways this could go AFAICS: [...]
>> 2. We establish some other type for UTF-8 and *it* becomes the lingua
>> franca
> If Boost abandons std::string in interfaces that expects UTF-8, does
> that mean I as a user need to sprinkle
> boost::to_utf_8(my_std_string,...) // in whatever form to_utf8 may be
> all over my/ours (quite gigantic) code base?

Only for functions that need to know the encoding of a string. As
Artyom has rightly pointed out, most functions operate perfectly well
by treating strings as opaque blocks of data, or as individual bytes.
It's only things like Boost.RegEx or some of the string-manipulation
functions that might want to act a bit differently in the face of
multi-byte characters. Or, of course, newly-written functions in user
code, outside of the Boost library.

> Without doing so, I assume will cause compilation errors, but for what
> gain? If some code was broken before, it will remain so after I've
> injected all those to_utf8 calls as well.
> To solve actual problems I need to track the origin of my
> std::string's content, which require a traditional bug-hunting
> session anyway. No additional typed interface in the world will help
> me here IMO.

Maybe. But having a function whose parameters or return type is
explicitly utf8_t will tell you (and the compiler) exactly what kind of
string it's expecting, right in the code, whereas something that takes
or returns an std::string doesn't. If you have to look up that
information in the documentation, you're a lot more likely to miss it.

> [...] What would be helpful if doable, is to build boost with
> BOOST_TRACK_INVALID_UTF_8, also for release builds.
> This would cause an exception or a call to user-defined function if
> boost code stumbles upon bad strings.

Interesting idea, but it pushes the problem entirely to runtime. Having
utf*_t types lets the compiler do at least some of the work for you.

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at