|
Boost : |
Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Eric Niebler (eric_at_[hidden])
Date: 2009-07-20 13:36:31
Phil Endecott wrote:
> Rogier van Dalen wrote:
>> On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
>>> The idea of an "iterator that knows where its end is" is something that
>>> comes up fairly often; do the Range experts have any comments about it?
>>>
>>> I think that in this case an iterator that can be incremented and
>>> dereferenced in some limited way beyond its end would be sufficient.
>>> ?For
>>> example, a std::string normally has a 0 byte beyond its end so that
>>> c_str()
>>> can work, so it is safe (for some value of safe!) to keep advancing a
>>> std::string::iterator until a 0 is seen, without looking for end().
>>> ?A UTF-8
>>> decoding algorithm that processes multi-byte characters by continuing
>>> until
>>> the top bit is not set would safely terminate in this case.
>>
>> (I don't think I'm a Range expert.) I think there are problems with
>> this example. Adding '\0' at the end is not mandated by the standard,
>> right?
>
> My recollection is that the standard makes it hard to implement a
> std::string that does not have a 0 after the last element, or that has
> non-contiguous storage.
I believe that as of C++03, std::string is required to have contiguous
storage. The null-termination is another thing. The only guarantee is
that the char* returned by c_str() is required to be null terminated. No
such guarantee is made for the sequence traversed by std::string's
iterators. Dereferencing the end iterator is verboten.
> These are assumptions that I would probably be
> happy to accept in my own internal-use code, but you are right to say
> that they are probably not appropriate for library code. For library
> code you would need to wrap the container with something that guarantees
> this behaviour in a more solid way. Out of interest, does anyone know
> if a std::vector that has been reserve()d guarantees anything about
> dereferencing beyond-the-end iterators? It would be great if they were
> allowed to be undefined yet certain not to segfault.
Undefined behavior. Don't do it.
>> Also, '\0' could also occur in the middle of the sequence.
>
> That doesn't cause a problem for this application.
But not in general.
>>> For iterators that don't offer this sort of behaviour you can provide a
>>> wrapper that knows where end is and returns a sentinel 0 in that case.
>>
>> Wouldn't this end up requiring two if-statements? In general, a
>> sentinel which is in the valid range of the value type would be an
>> awkward sentinel.
>
> No, the point is that it doesn't need any extra if statements at all.
> The code is already looking for a top-bit-clear byte to indicate the end
> of the multibyte character, and the 0 byte does that.
>
>> However, I can see where you're coming from. Being able to tell from
>> an iterator whether it's at the end of its range is often useful. Is
>> operator*() is the right place to implement this functionality?
>> Wouldn't a free function
>> is_at_end(Iterator)
>> make more sense?
>
> Then you do have the extra if statement.
If you write a generic algorithm that takes a ForwardRange or two
ForwardIterators, you're not allowed to assume that the sequence is
terminated with a sentry. Why? Because that's not part of the
ForwardRange/ForwardIterator concept.
You would be allowed to define a refinement, say
ContiguousNullTerminatedIterator concept such that &*it is required to
point to a contiguous block of memory that is terminated with a sentry
at the position specified by the end iterator. You could also define a
trait to detect conformance to the concept. Then you could use this
trait to dispatch to an algorithm optimized for that trait.
This seems like a lot of work for little gain to me, though. I'd like to
see performance numbers to justify a design like that.
-- Eric Niebler BoostPro Computing http://www.boostpro.com
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk