Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-07-20 10:07:36

Rogier van Dalen wrote:
> On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
>> The idea of an "iterator that knows where its end is" is something that
>> comes up fairly often; do the Range experts have any comments about it?
>> I think that in this case an iterator that can be incremented and
>> dereferenced in some limited way beyond its end would be sufficient. ?For
>> example, a std::string normally has a 0 byte beyond its end so that c_str()
>> can work, so it is safe (for some value of safe!) to keep advancing a
>> std::string::iterator until a 0 is seen, without looking for end(). ?A UTF-8
>> decoding algorithm that processes multi-byte characters by continuing until
>> the top bit is not set would safely terminate in this case.
> (I don't think I'm a Range expert.) I think there are problems with
> this example. Adding '\0' at the end is not mandated by the standard,
> right?

My recollection is that the standard makes it hard to implement a
std::string that does not have a 0 after the last element, or that has
non-contiguous storage. These are assumptions that I would probably be
happy to accept in my own internal-use code, but you are right to say
that they are probably not appropriate for library code. For library
code you would need to wrap the container with something that
guarantees this behaviour in a more solid way. Out of interest, does
anyone know if a std::vector that has been reserve()d guarantees
anything about dereferencing beyond-the-end iterators? It would be
great if they were allowed to be undefined yet certain not to segfault.

> Also, '\0' could also occur in the middle of the sequence.

That doesn't cause a problem for this application.

>> For iterators that don't offer this sort of behaviour you can provide a
>> wrapper that knows where end is and returns a sentinel 0 in that case.
> Wouldn't this end up requiring two if-statements? In general, a
> sentinel which is in the valid range of the value type would be an
> awkward sentinel.

No, the point is that it doesn't need any extra if statements at all.
The code is already looking for a top-bit-clear byte to indicate the
end of the multibyte character, and the 0 byte does that.

> However, I can see where you're coming from. Being able to tell from
> an iterator whether it's at the end of its range is often useful. Is
> operator*() is the right place to implement this functionality?
> Wouldn't a free function
> is_at_end(Iterator)
> make more sense?

Then you do have the extra if statement.


Boost list run by bdawes at, gregod at, cpdaniel at, john at