Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-21 12:32:42


On Fri, 21 Jan 2011 01:35:15 -0800
Patrick Horgan <phorgan1_at_[hidden]> wrote:

>On 01/20/2011 12:52 PM, Mostafa wrote:
>
>> On second thought, is there really a need to access the underlying
>> data of utf8_t? I argue that having a view of the underlying data
>> via iterators accomplishes just as much(*), and is more inline with
>> the stl tradition of containers and iterators, not to mention the
>> better encapsulation it affords the interface. Do clients really
>> need to know, and potentially develop a dependency on, the fact that
>> utf8_t (for now?) is really just a wrapper for std::string?
>
> What type would be returned by operator* on the iterator for a
> utf8_string? [...]

Which iterator? ;-) As I'd envisioned it, there would be three: an
element iterator using char, a code-point iterator using char32_t, and
a true character iterator using a custom class. The custom class might
be ugly and hard to work with, but would be guaranteed to do the right
thing.

> There's a lot of other issues. Assuming it has the same interface as
> std::string how would you do max_size()? How about the comparison
> operators? [...]

max_size would have to operate on char elements, as there's no other
accurate answer. Comparison operators would either operate on
code-points or, through Boost.Locale, characters.

> What would the equivalent be for utf8_string? For the above, the rhs
> is in effect converted to basic_string for the comparison. For a
> utf8_string, what if the rhs doesn't convert to utf-8? Should there
> be some conversion facet able to be specified for the rhs?

The more people discuss it, the more I think automatic conversions from
std::string to the UTF types is the wrong way to go about it. It would
be convenient, and would do the right thing in 90% of cases -- but it
would do absolutely the *wrong* thing in the other 10%, where the
std::string does *not* contain the encoding that the UTF constructor
assumes. And most developers wouldn't think about that until they ran
into it the hard way, after their programs were in widespread use.

> std::string's comparison operators are supposed to take linear time.
> [...]

Obviously the hypothetical boost::string would have some slight
differences from std::string. It would have to.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk