Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2004-11-29 12:14:40


>I used Ron Garcias utf8 codecvt facet that I down loaded from yahoo files
> section. I just wrote some tests and make some tweaks as test results for
> various platforms came it. In includes a manual page in standard boost
> format.
>
> In the same section (http://groups.yahoo.com/group/boost/files/unicode/)
> there is another file (utf8_transform_iterator) which wraps the same
> functionality in an standard iterator.

I know, he sent me the HTML page a while back ;-)

The main differences between the two are:

A) Ron has written an iterator-generator (via boost::iterator_adapter),
rather than an iterator (via boost::iterator_facade) as I have. Personally
I find the latter easier to use, but that may be personal preference.

B) I think my iterators have more checks for invalid code sequences than
Ron's (if you don't check everything that you can then some really bad
things can happen, more on this later). Even so there may well be more
checks that can be added.

C) Ron has a single adapter (makes a UTF-8 sequence look like a UTF-32 one),
I have a whole family of them, here's the synopsis:

1) Read Only, Input Adapters:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

template <class BaseIterator, class U8Type = ::boost::uint8_t>
class u32_to_u8_iterator;

Adapts sequence of UTF-32 code points to "look like" a sequence of UTF-8.

template <class BaseIterator, class U32Type = ::boost::uint32_t>
class u8_to_u32_iterator;

Adapts sequence of UTF-8 code points to "look like" a sequence of UTF-32.

template <class BaseIterator, class U16Type = ::boost::uint16_t>
class u32_to_u16_iterator;

Adapts sequence of UTF-32 code points to "look like" a sequence of UTF-16.

template <class BaseIterator, class U32Type = ::boost::uint32_t>
class u16_to_u32_iterator;

Adapts sequence of UTF-16 code points to "look like" a sequence of UTF-32.

2) Single pass output iterator adapters:
~~~~~~~~~~~~~~~~~~~~~~~~~~~

template <class BaseIterator>
class utf8_output_iterator;

Accepts UTF-32 code points and forwards them on as UTF-8 code points.

template <class BaseIterator>
class utf16_output_iterator;

Accepts UTF-32 code points and forwards them on as UTF-16 code points.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

D) You will note that there are still two adapters missing - to convert
between UTF-16 and UTF-32 - but I haven't needed these, and they could I
suppose be composed from the other 4.

E) Ron's code accepts up to 6 octets in a UTF-8 sequence mine only accepts
4: I think this is the difference between Unicode and ISO-10646-1, and may
be a feature or a bug depending upon your point of view. Personally I
needed to ensure that only valid UTF-32 sequences were generated.

F) On error checking, I have come to the conclusion that both Ron and I have
fallen into the same trap:

If the conversion iterator is constructed from a single base-iterator, and
that base-iterator does not point to the start of a valid utf-8 sequence
then it becomes possible for the adapter to increment past the end of a
sequence, or decrement past the start of one (Note my code does trap invalid
UTF-8 sequences provided they are not at the end of a range).

I believe this problem can be solved by constructing a pair of adapters from
a pair of base iterators: the end points can then be checked to ensure that
nothing bad can happen. We wouldn't want a corrupt UTF-8 sequence to crash
your program after all :-)

The alternative is to leave it to the user to call a "check_range" or
similar function, but this still looks error prone to me. Oh, and the same
problem arises when iterating UTF-16 sequences as well (the sequence must
not start with a low-surrogate, or end with a high surrogate).

Regards,

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk