Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-03-09 09:53:53


Phil Endecott wrote:
> Sebastian Redl wrote:
>
>> Another conceptual problem in your traits. Take a look at UTF-8's
>> skip_forward_char:
>>
>> template <typename char8_ptr_t>
>> static void skip_forward_char(char8_ptr_t& i) {
>> do {
>> ++i;
>> } while (!char_start_byte(*i)); // Maybe hint this?
>> }
>>
>> And this loop:
>>
>> for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) {
>> }
>>
>> This will always invoke undefined behaviour. Consider the case where it
>> is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char()
>> will indeed do ++it, and then do *it, thus dereferencing the
>> past-the-end iterator. Boom.
>>
>
> Yes, absolutely. I'm aware of this and similar problems. But please
> keep reporting them :-)
>
> In this case the problem is slightly less serious if you write
> something more like
>
> skip_forward_char(char8_ptr_t& i) {
> advance(i,char_length(*i));
> }
>
> In this case you don't dereference an invalid iterator if the input is
> valid and complete UTF8. That might be useful in some circumstances.
> But computing char_length is actually harder than the loop with char_start_byte().
>
The loop can also be done like this (I prefer unsigned char as the base
for UTF-8 because of the bit fiddling):

unsigned char lb = *i;
++i;
if((lb & 0x80) == 0) return;
lb <<= 1;
while((lb & 0x80) != 0) {
  ++i;
  lb <<= 1;
}

This might be faster than first determining the character count. Depends
on how you do it, and on the architecture. Profiling needed, also for
testing against an end iterator.
> On the other hand my code does work for zero-terminated data, which is
> useful in the case of std::string::c_str().
But not in the general case.
> I presume that the
> standard doesn't guarantee that dereferencing the byte after the end of
> a string returns 0, even though an implementation that provides c_str()
> in the obvious way would have to do so, right?
>
No such guarantee, but yes, for the "obvious" implementation this would
be correct. An implementation doesn't have to be obvious, though. There
are still some left that aren't, I think.

We can implement UTF-8's and UTF-16's skip_forward by looking at the
current byte. But does that work with all encodings? I think it doesn't
work for shift encodings, unless you're willing to come to a stop on a
shift character. I'm not: there's a rule for some shift encodings that
they *must* end in the initial shift state, which means that there's a
good chance that a shift character is the last thing in the string. This
would mean, however, that if you increment an iterator that points to
the last real character, it must scan past the shift character or it
won't compare equal to the end iterator. Unless you're willing to scan
past the shift in the equality test, another thing I wouldn't do.

Seems to me that shift encodings are a lot more pain than they're worth.
I really have to wonder why anyone would ever have come up with them.

> // But this may terminate either because it reached the end of the
> // input or because it reached the end of the output. So perhaps it
> // needs to return a pair<> of iterators reporting how far it got through
> // each.
>
I think that was the point where I gave up the last time I started
developing a character conversion library. :-)
Here's another thing to think about: it is not possible to construct an
end iterator for any of the insert iterators. The without-end version is
therefore absolutely necessary.
> I've added functions (or maybe
> constants) to the charset_traits indicating the maximum number of units
> per character.
>
Another thing you need: a trait to calculate exactly the number of units
needed to store a codepoint.

std::size_t units_for_char(char_t cp, shift_state s);

Returns e.g. between 1 and 4 for UTF-8. Actually, I think the
shift_state should be passed by reference and be modified, so that you
can run through a sequence of codepoints and calculate precisely the
needed length.

typedef charset_traits<Enc> traits;
std::size_t length = 0;
typename traits::state_type state = state_type();
for(auto cpit = codepoints.begin(); cpit != codepoints.end(); ++cpit) {
  length += traits::units_for_char(*cpit, state);
}
length += traits::units_for_finish(state);

The units_for_finish trait again refers to the requirement of some shift
encodings to finish in the default shift state. units_for_finish
calculates the number of units needed to shift from the given state back
to the initial state.
> BTW I have just written a base64 decoding iterator adaptor. It also
> needs you to pass an iterator referring to the end of the data so that
> it can do the right thing at the end. Anyone interested?
>
I'm sure someone could find a use for such a utility.

Sebastian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk