Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-03-08 14:56:10


Sebastian Redl wrote:
> Phil Endecott wrote:
>> OK, the code is here:
>> http://svn.chezphil.org/libpbe/trunk/include/charset/
>>
>> and there are some very basic docs here:
>> http://svn.chezphil.org/libpbe/trunk/doc/charsets/
>> (Have a look at intro.txt for the feature list.)
>>
> Another conceptual problem in your traits. Take a look at UTF-8's
> skip_forward_char:
>
> template <typename char8_ptr_t>
> static void skip_forward_char(char8_ptr_t& i) {
> do {
> ++i;
> } while (!char_start_byte(*i)); // Maybe hint this?
> }
>
> And this loop:
>
> for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) {
> }
>
> This will always invoke undefined behaviour. Consider the case where it
> is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char()
> will indeed do ++it, and then do *it, thus dereferencing the
> past-the-end iterator. Boom.

Yes, absolutely. I'm aware of this and similar problems. But please
keep reporting them :-)

In this case the problem is slightly less serious if you write
something more like

skip_forward_char(char8_ptr_t& i) {
   advance(i,char_length(*i));
}

In this case you don't dereference an invalid iterator if the input is
valid and complete UTF8. That might be useful in some circumstances.
But computing char_length is actually harder than the loop with char_start_byte().

On the other hand my code does work for zero-terminated data, which is
useful in the case of std::string::c_str(). I presume that the
standard doesn't guarantee that dereferencing the byte after the end of
a string returns 0, even though an implementation that provides c_str()
in the obvious way would have to do so, right?

I'm not sure what the best solution to that problem is, but I have
thought more about the converse case where you're storing UTF8 using
the output iterator, and you're writing into a fixed-size buffer, e.g. (pseudo-code)

char* iso88591_data;
size_t iso8859_data_length;

// The UTF8 data will take more space than the iso8859_1 data;
// maybe we know that in our case most bytes will be ASCII, so we allow
// a 10% overhead:
char* utf8_data = new char[iso_8859_data_length * 1.1];

// In the rare case where that's insufficient we'll abort and retry
with a
// larger buffer, or do the rest in another chunk or something.

// Iterator to store utf8:
character_output_iterator<utf8> utf8_it(utf8_data);

// First thought is to use a function with the same signature as std::copy:
seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it);

// But that doesn't allow us to specify the end of the output buffer. So
// we make that an additional parameter:
character_output_iterator<utf8> utf8_end_it(utf8_data+utf8_length);
seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it, utf8_end_it);

// But this may terminate either because it reached the end of the
// input or because it reached the end of the output. So perhaps it
// needs to return a pair<> of iterators reporting how far it got through
// each.

But I'm also concerned that the inner loops in these conversion
algorithms shouldn't be doing more comparisons than is absolutely
necessary. So I'm currently considering having both versions, with and
without the destination-end iterator. I've added functions (or maybe
constants) to the charset_traits indicating the maximum number of units
per character. The bounded version can then be implemented something
like this: (pseudo-code!)

seq_conv(in_start, in_end, out_start, out_end) {
   size_t out_length = out_end-out_start;
   max_chars = out_length / charset_traits<cset>::max_units_per_char();
   // We can safely copy max_chars from in to out without worrying about out_end:
   (in_next,out_next) = seq_conv(in_start, min(in_end,
in_start+max_chars), out_start);
   // We do need to worry about out_end while copying the others:
   seq_conv(in_next, in_end, out_next, out_end);
}

> Compare with filter_iterator. skip_forward_char *must* take the end
> iterator, too, and stop when reaching it. This, in turn, makes the
> charset adapter iterator that much more complicated.

Yes. filter_iterator is a good example; I would like to be consistent
with existing practice when it's appropriate to do so. As you can see
I'm progressing quite slowly with this work. This has the advantage
that I have plenty of time to think about what I should do next before
I implement it....

BTW I have just written a base64 decoding iterator adaptor. It also
needs you to pass an iterator referring to the end of the data so that
it can do the right thing at the end. Anyone interested?

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk