Boost logo

Boost :

Subject: Re: [boost] [UTF String] UTF String library 1.5 ready for perusal
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-02-10 19:52:54

On Fri, 11 Feb 2011 00:07:58 +0100
Mathias Gaunard <mathias.gaunard_at_[hidden]> wrote:

>>> There is no need for any reasoning: look at the code of your code
>>> point iterator. It uses a pointer and indexes, and is therefore not
>>> a generic iterator adaptor.
>> It wasn't meant to be generic. It was meant to be exactly what it is:
>> an iterator specific to the UTF type where it's defined. For that
>> purpose, it's designed exactly as it should be, IMHO.
> It was already stated on this list that the ability to deal with
> arbitrary ranges is more valuable; I am merely stating that your
> iterator could work with any iterator with virtually no change.

Either I missed that statement, or I don't recognize it in this
context, because I'm not sure what you mean.

>> I could make it fully generic, but it wouldn't be nearly as efficient
>> that way. I chose to do the extra work to make it efficient.
> Your code never uses the fact that the iterator is a pointer or that
> memory is stored contiguously.

How would you suggest that it use that information?

I think we have different definitions of generic. As I said in my last
message, I define fully generic as working with any UTF-encoded string
(not just UTF-8). That would be possible, but would almost certainly be
less processor-efficient than having iterators customized to each type.

> I also don't think that for example your utf-8 iterating strategy is
> very fast. Your utf-8 decoding itself seems to have lots of
> repetition and unnecessary tests and memory accesses...
> Not that it is easy to make that kind of thing fast anyway.

There's always room for further optimization, at the cost of more
programmer time and more code.

>> It could be a bidirectional iterator, as it has all of the abilities
>> of one. And it could be a random access iterator, as it has all but
>> one of the requirements for that (and in many cases has all of
>> them). Given that choice, I chose to make it a random access
>> iterator.
> It has constant-time distance, which isn't very useful and adds
> unnecessary overhead to your iterator.

I disagree. I find it extremely useful to have a true random access
iterator for some strings (most, in UTF-16, and arguably most in many
cases of UTF-8 too), and an emulated one for the rest. And for that,
the overhead isn't unnecessary.

>>> I'm not a fan of returning a reference in operator* as well.
>> No choice in that, I ran into at least one STL algorithm under GCC
>> that wouldn't compile if it wasn't a reference, even when it was
>> only being read. I don't remember which one, but it was something
>> important and commonly-used enough that breaking it was not an
>> option.
> I believe the standard containers indeed require it to be a
> reference, but I'm not aware of any problems with any implementation.
> Would you mind telling me what libstdc++ algorithm relies on this?

If I recalled which one it was, I would have put it in the original

>> They give me a way to prevent my iterator code from walking off the
>> beginning or end of the underlying string.
>> The only other way to do it would be to store a pointer to the
>> string object in every iterator, or a pair of iterators or pointers
>> to the underlying type, which I considered worse.
> You seek the next "first" character of a code-unit sequence. That
> indeed causes problems when you reach the end (unless you put a 0 at
> the end of your buffer, which isn't such a bad idea).
> UTF-8 and 16, however, don't require this. You can deduce how many
> code units you need to consume from the first code unit. You do that
> in your decoders.

And I didn't want to duplicate that code in the iterators as well, or
separate it out and possibly add the overhead of another function call
to the decoder functions.

>> As an important side benefit, they also provide an efficient way to
>> calculate the difference in code points for operator-, which I feel
>> is important.
>> And the size in code-points is (supposed to be) stored at all times.
> What's the point of this?
> The size in code points is not a very useful thing in general.

On the contrary, keeping track of the length of the string is *very*
useful. The alternative is to calculate it on the fly, whenever someone
asks for it. If you want that, you can always go back to using C-style

>> All the UTF types were very carefully designed so that there's no
>> chance of invalid data in them, barring extraordinary measures to
>> deliberately corrupt it.
> That's a good thing.
> However, you could use this opportunity to make decoding much faster,
> since you don't need to check for correctness anymore.

As I said, there's always room for further optimization.

>> Anything else? :-)
> You would probably encounter problems on platforms where int is 8 or
> 16 bits.

I haven't seen a platform where an int is 16 bits since DOS, which I
stopped coding for in the late nineties. And I've never seen one where
it's eight bits. Do you know of any modern platform -- as in one that
uses Unicode, and could usefully use this library -- where that's the

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at