Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-21 04:35:15


On 01/20/2011 12:52 PM, Mostafa wrote:
> ... elision by patrick ...
>
> On second thought, is there really a need to access the underlying
> data of utf8_t? I argue that having a view of the underlying data via
> iterators accomplishes just as much(*), and is more inline with the
> stl tradition of containers and iterators, not to mention the better
> encapsulation it affords the interface. Do clients really need to
> know, and potentially develop a dependency on, the fact that utf8_t
> (for now?) is really just a wrapper for std::string?
What type would be returned by operator* on the iterator for a
utf8_string? char32_t? What do you do about combining characters?
Return them one at a time and let the application deal with it? That's
what I think. I don't see what else you could do. There's a lot of
other issues. Assuming it has the same interface as std::string how
would you do max_size()? How about the comparison operators? There's:

template<typename charT, typename traits, typename Allocator>
bool operator<=(const basic_string<charT, traits, Allocator>& lhs, const
charT* rhs);

What would the equivalent be for utf8_string? For the above, the rhs is
in effect converted to basic_string for the comparison. For a
utf8_string, what if the rhs doesn't convert to utf-8? Should there be
some conversion facet able to be specified for the rhs? std::string's
comparison operators are supposed to take linear time. These would

capacity() is supposed to return the largest number of characters the
string can hold without reallocation. Would you return that by
considering that the smallest characters would only take one byte?

The std::string's operator[] is supposed to work in constant time. This
one couldn't. It would be fun to make it, but it would have to differ
in some ways from the specification of std::string.

How about push_back or insert? What do they take for the argument? A
char32_t encoded as utf-32? Of course you'd have to insert combining
characters one part at a time.

If you have LC_COLLATE set to en_US.utf8 then std::sort should just
work. (Replace en_ with whatever is used in your locale.)

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk