Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-07-21 07:28:59


Eric Niebler wrote:
> Phil Endecott wrote:
>> Rogier van Dalen wrote:
>>> On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
>>>> The idea of an "iterator that knows where its end is" is something that
>>>> comes up fairly often; do the Range experts have any comments about it?
>>>>
>>>> I think that in this case an iterator that can be incremented and
>>>> dereferenced in some limited way beyond its end would be sufficient.
>>>> ?For
>>>> example, a std::string normally has a 0 byte beyond its end so that
>>>> c_str()
>>>> can work, so it is safe (for some value of safe!) to keep advancing a
>>>> std::string::iterator until a 0 is seen, without looking for end().
>>>> ?A UTF-8
>>>> decoding algorithm that processes multi-byte characters by continuing
>>>> until
>>>> the top bit is not set would safely terminate in this case.
>>>
>>> (I don't think I'm a Range expert.) I think there are problems with
>>> this example. Adding '\0' at the end is not mandated by the standard,
>>> right?
>>
>> My recollection is that the standard makes it hard to implement a
>> std::string that does not have a 0 after the last element, or that has
>> non-contiguous storage.
>
> I believe that as of C++03, std::string is required to have contiguous
> storage. The null-termination is another thing. The only guarantee is
> that the char* returned by c_str() is required to be null terminated. No
> such guarantee is made for the sequence traversed by std::string's
> iterators. Dereferencing the end iterator is verboten.

Here is an outline of a zero-overhead wrapper for std::string that I
hope guarantees that *end() == 0:

struct string_with_zero_beyond_end: std::string {
   typedef const char* const_iterator;
   const_iterator begin() const { return c_str(); }
   const_iterator end() const { return c_str() + length(); }
};

>> These are assumptions that I would probably be
>> happy to accept in my own internal-use code, but you are right to say
>> that they are probably not appropriate for library code. For library
>> code you would need to wrap the container with something that guarantees
>> this behaviour in a more solid way. Out of interest, does anyone know
>> if a std::vector that has been reserve()d guarantees anything about
>> dereferencing beyond-the-end iterators? It would be great if they were
>> allowed to be undefined yet certain not to segfault.
>
> Undefined behavior. Don't do it.

Something similar is possible.

>>> Also, '\0' could also occur in the middle of the sequence.
>>
>> That doesn't cause a problem for this application.
>
> But not in general.

We're not concerned with "in general", we're concerned with this
particular problem of traversing bytes of UTF-8 efficiently with the
possibility of malformed (e.g. malicious) input, without having to
constantly check for end-of-input.

>>>> For iterators that don't offer this sort of behaviour you can provide a
>>>> wrapper that knows where end is and returns a sentinel 0 in that case.
>>>
>>> Wouldn't this end up requiring two if-statements? In general, a
>>> sentinel which is in the valid range of the value type would be an
>>> awkward sentinel.
>>
>> No, the point is that it doesn't need any extra if statements at all.
>> The code is already looking for a top-bit-clear byte to indicate the end
>> of the multibyte character, and the 0 byte does that.
>>
>>> However, I can see where you're coming from. Being able to tell from
>>> an iterator whether it's at the end of its range is often useful. Is
>>> operator*() is the right place to implement this functionality?
>>> Wouldn't a free function
>>> is_at_end(Iterator)
>>> make more sense?
>>
>> Then you do have the extra if statement.
>
> If you write a generic algorithm that takes a ForwardRange or two
> ForwardIterators, you're not allowed to assume that the sequence is
> terminated with a sentry. Why? Because that's not part of the
> ForwardRange/ForwardIterator concept.
>
> You would be allowed to define a refinement, say
> ContiguousNullTerminatedIterator concept such that &*it is required to
> point to a contiguous block of memory that is terminated with a sentry
> at the position specified by the end iterator. You could also define a
> trait to detect conformance to the concept. Then you could use this
> trait to dispatch to an algorithm optimized for that trait.

I had something to detect contiguous storage to do the trick of casting
to an int and checking 4 or 8 top bits simultaneously.

> This seems like a lot of work for little gain to me, though. I'd like to
> see performance numbers to justify a design like that.

Cory Nelson reported that Windows 7 can decode UTF-8 3 times faster
than his code. I don't want Boost to end up with UTF-8 features that
are all lovely and generic and full of concepts but that run 3 times
slower on common types (strings, vector<char>, points) than they could
if they were less generic.

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk