Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Mostafa (mostafa_working_away_at_[hidden])
Date: 2011-01-21 07:26:33


On Fri, 21 Jan 2011 01:35:15 -0800, Patrick Horgan <phorgan1_at_[hidden]>
wrote:

> On 01/20/2011 12:52 PM, Mostafa wrote:
>> ... elision by patrick ...
>>
>> On second thought, is there really a need to access the underlying data
>> of utf8_t? I argue that having a view of the underlying data via
>> iterators accomplishes just as much(*), and is more inline with the stl
>> tradition of containers and iterators, not to mention the better
>> encapsulation it affords the interface. Do clients really need to
>> know, and potentially develop a dependency on, the fact that utf8_t
>> (for now?) is really just a wrapper for std::string?
> What type would be returned by operator* on the iterator for a
> utf8_string? char32_t? What do you do about combining characters?
> Return them one at a time and let the application deal with it? That's
> what I think. I don't see what else you could do. There's a lot of
> other issues. Assuming it has the same interface as std::string how
> would you do max_size()? How about the comparison operators? There's:
>
> template<typename charT, typename traits, typename Allocator>
> bool operator<=(const basic_string<charT, traits, Allocator>& lhs, const
> charT* rhs);
>
> What would the equivalent be for utf8_string? For the above, the rhs is
> in effect converted to basic_string for the comparison. For a
> utf8_string, what if the rhs doesn't convert to utf-8? Should there be
> some conversion facet able to be specified for the rhs? std::string's
> comparison operators are supposed to take linear time. These would
>
> capacity() is supposed to return the largest number of characters the
> string can hold without reallocation. Would you return that by
> considering that the smallest characters would only take one byte?
>
> The std::string's operator[] is supposed to work in constant time. This
> one couldn't. It would be fun to make it, but it would have to differ
> in some ways from the specification of std::string.
>
> How about push_back or insert? What do they take for the argument? A
> char32_t encoded as utf-32? Of course you'd have to insert combining
> characters one part at a time.
>
> If you have LC_COLLATE set to en_US.utf8 then std::sort should just
> work. (Replace en_ with whatever is used in your locale.)
>
> Patrick

Interesting questions, but how do they relate to the sequence of posts you
cited?

Never the less, let me attempt to address some of them in the context of
utf8_t and what I had posted. I was thinking that utf8_t should just be
considered a container, whose interface only deals with iterators when it
comes to "element" access; and that there should be 3 types of such
iterators: code unit iterators, code point iterators, and character
iterators. The utf8_t api should not accept or return individual code
unit types (ie, an octet type), or individual code point types (ie, a 32
bit type), and, obviously, individual character types since there is no
C++ type that can represent any unicode character.

Thus, insert() and push_back() would take a range of iterators, etc...

And does operator[] make sense for utf8_t, or should it be more aptly
named:

        iterator_range character(size_t const ordinal_position)

Though, I would argue one wouldn't need any of the latter two methods if
the aforementioned iterators are random access (and I don't see a reason
why they shouldn't be).

Mostafa


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk