Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Mostafa (mostafa_working_away_at_[hidden])
Date: 2011-01-21 07:26:33

On Fri, 21 Jan 2011 01:35:15 -0800, Patrick Horgan <phorgan1_at_[hidden]>

> On 01/20/2011 12:52 PM, Mostafa wrote:
>> ... elision by patrick ...
>> On second thought, is there really a need to access the underlying data
>> of utf8_t? I argue that having a view of the underlying data via
>> iterators accomplishes just as much(*), and is more inline with the stl
>> tradition of containers and iterators, not to mention the better
>> encapsulation it affords the interface. Do clients really need to
>> know, and potentially develop a dependency on, the fact that utf8_t
>> (for now?) is really just a wrapper for std::string?
> What type would be returned by operator* on the iterator for a
> utf8_string? char32_t? What do you do about combining characters?
> Return them one at a time and let the application deal with it? That's
> what I think. I don't see what else you could do. There's a lot of
> other issues. Assuming it has the same interface as std::string how
> would you do max_size()? How about the comparison operators? There's:
> template<typename charT, typename traits, typename Allocator>
> bool operator<=(const basic_string<charT, traits, Allocator>& lhs, const
> charT* rhs);
> What would the equivalent be for utf8_string? For the above, the rhs is
> in effect converted to basic_string for the comparison. For a
> utf8_string, what if the rhs doesn't convert to utf-8? Should there be
> some conversion facet able to be specified for the rhs? std::string's
> comparison operators are supposed to take linear time. These would
> capacity() is supposed to return the largest number of characters the
> string can hold without reallocation. Would you return that by
> considering that the smallest characters would only take one byte?
> The std::string's operator[] is supposed to work in constant time. This
> one couldn't. It would be fun to make it, but it would have to differ
> in some ways from the specification of std::string.
> How about push_back or insert? What do they take for the argument? A
> char32_t encoded as utf-32? Of course you'd have to insert combining
> characters one part at a time.
> If you have LC_COLLATE set to en_US.utf8 then std::sort should just
> work. (Replace en_ with whatever is used in your locale.)
> Patrick

Interesting questions, but how do they relate to the sequence of posts you

Never the less, let me attempt to address some of them in the context of
utf8_t and what I had posted. I was thinking that utf8_t should just be
considered a container, whose interface only deals with iterators when it
comes to "element" access; and that there should be 3 types of such
iterators: code unit iterators, code point iterators, and character
iterators. The utf8_t api should not accept or return individual code
unit types (ie, an octet type), or individual code point types (ie, a 32
bit type), and, obviously, individual character types since there is no
C++ type that can represent any unicode character.

Thus, insert() and push_back() would take a range of iterators, etc...

And does operator[] make sense for utf8_t, or should it be more aptly

        iterator_range character(size_t const ordinal_position)

Though, I would argue one wouldn't need any of the latter two methods if
the aforementioned iterators are random access (and I don't see a reason
why they shouldn't be).


Boost list run by bdawes at, gregod at, cpdaniel at, john at