Boost logo

Boost :

From: Hervé Brönnimann (hervebronnimann_at_[hidden])
Date: 2006-12-06 00:35:11


Nemanja: I don't use UTF8 myself, but I've had to program
is_valid_utf8 and utf8_to_xml10 recently. You're lucky that I just
gave is_valid as an assignment to a student! I haven't done a review
in a long time, not even for formal reviews, and this is only an
informal review.

Given that unicode is a standard that *should* be supported by C++,
really, this would be a prime candidate for inclusion in boost. I
strongly support your efforts. I started writing a few comments, and
it comes almost as a complete review :-) So here goes my mini-review.

Keep in mind I want to be as constructive and positive as possible.
I'm definitely hoping something will come out of this.

DESIGN
About the documentation: I wish there was a short glossary at the
beginning (it's not provided by unicode.org - the glossary there is
wayyyy too big!). Just to explain "code point", "byte sequence",
"surrogate", etc. Some pointers to the unicode.org FAQ (e.g. for
BOM). Also, before the reference, it would be nice to have a
synopsis to refer to just to remember the interface. And as things
will evolve (interface, design, etc.) you'll need to start a section
Rationale at the end (before Conclusions).

About the interface: The name "bom", even though it is inside
namespace utf8, could be spelled out byte_order_mark (to help users
know what to google for if they don't know the context).

Generally, I don't think exceptions are appropriate here. A return
status code would serve equally well and be more efficient. In fact,
for append, it is enough to check that "result == append(cp, result)"
to know you have an invalid code point. A similar design works with
the other functions.

next() and prior() should not return the code point. If the only goal
is to advance the iterator, why impose additional calculations? If
the code point is needed, have other functions called extract_next()
and extract_prior(). Unicode has the advantage that you know where
the sequences begin, so next(it,end) could even work if 'it' points
to the middle of a byte sequence, but it would require to backtrack
first to see the whole sequence if you need to return a value, even
though this value may be discarded by the caller. Finally, what if
end is not an issue (e.g. you know the input is valid and don't want
to test for end of sequence)? You may choose not to support this if
you want to promote safe usage, but for efficiency some users may
require it. It would be better imho to have:
     template <typename octet_iterator>
         std::pair<octet_iterator,bool> next(octet_iterator it);
     template <typename octet_iterator>
         std::pair<octet_iterator,bool> next(octet_iterator it,
octet_iterator end);
             // Return the beginning of the next byte sequence after
the specified 'it'.
             // Note that this next beginning can never be equal to
'it'. If 'end' is specified
             // and is encountered before the next beginning is
found, then return 'std::make_pair(it,true)'.
             // In case of an invalid UTF8 sequence, return
'std::make_pair(it2,false)' where 'it2' is the first
             // position that can be deduced to be invalid starting
at 'it'.
In this way, all outcomes can be decoded (easily and efficiently)
from the return value. Similarly:
     template <typename octet_iterator>
         std::pair<octet_iterator,bool> extract_next(octet_iterator
it, uint32_t *cp);
     template <typename octet_iterator>
         std::pair<octet_iterator,bool> extract_next(octet_iterator
it, octet_iterator end, uint32_t* cp);
             // Return the beginning of the next byte sequence after
the specified 'it', and
             // load the code point in the range between 'it' and the
return value into '*cp' if 'cp' is not null.
             // Note that this next beginning can never be equal to
'it', and that 'it' should point to
             // the beginning of a byte sequence. If 'end' is
specified and is encountered
             // before the next beginning is found, then return
'std::make_pair(it,true)'.
             // In case of an invalid UTF8 sequence, return
'std::make_pair(it2,false)' where 'it2' is the first
             // position that can be deduced to be invalid starting
at 'it'.
This is only a suggestion, I'm not sure it is the best interface. It
nevertheless shows that exceptions can be completely avoided for next
() and prior as well.

Why is the parameter n in advance required to be positive? Likewise,
advance could be reformulated as:
     template <typename octet_iterator, typename distance_type>
          std::pair<distance_type,bool> advance (octet_iterator& it,
distance_type n);
     template <typename octet_iterator, typename distance_type>
          std::pair<distance_type,bool> advance (octet_iterator& it,
distance_type n, octet_iterator end);
              // Advances an iterator 'it' by the specified number of
code points within an UTF-8 sequence.
              // Note that that the return , and that 'it' should
point to the beginning of a byte sequence.
              // If 'end' is specified and is encountered during the
advance, then return 'std::make_pair(i,true)'
              // In case of an invalid UTF8 sequence, return
'std::make_pair(i,false)' and let 'it' point to the first
             // position that can be deduced to be invalid starting
at 'it'. Note that if n>0, ++it will be used,
             // and if n<0, --it will be used and the iterator type
will be required to be bidirectional.

Both the interface and the implementation can be improved by
specifying the requirements on iterators: it appears you want to
support bidirectional iterators, but you do require random access
(when you write 'end - it < length' for instance). I'm pretty sure
it's possible to implement most functions with a ForwardIterator, and
perhaps even an InputIterator. Advance might require bidirectional
if n<0, that can be asserted from the iterator traits. Also the
expected value type of the 'octet_iterator' could be explicited
(needs to be convertible to an integral type, right?)

IMPLEMENTATION:
I have some doubts about the correctness of the implementation, since
is_trail() always matches the range 80..BF which is good only for the
third and fourth bytes, but not the special cases for the second
byte. And the testing is certainly not sufficient to catch all the
nooks and crannies of the definition. Recall that a byte sequence is
valid if it conforms to the table 3.6 in the standard (below, sorry
for the formatting):

Table 3-6. Well-Formed UTF-8 Byte Sequences
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Does your code check for exceptional 1st bytes such as C2, E0, ED, F0
or F4? I couldn't see how.

Surely is_code_point_valid can be optimized. Perhaps the compiler
optimizes things a bit, but I'm pretty sure it can be improved.

More efficiency can be gained by tabulating the sequence_length
function (you need a table of 256 entries): I'm taking this from the
unicode.org sample code:
/*
  * Index into the table below with the first byte of a UTF-8
sequence to
  * get the number of trailing bytes that are supposed to follow it.
  * Note that *legal* UTF-8 values can't have 4 or 5-bytes.
  */
static const char sequence_length_for_UTF8[256] = {
     0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
     0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
     0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
     0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
     0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
     0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
     1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
     2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};

Even the validate_next could be made a bit more efficient by
tabulating the first byte: instead of the sequence length, the table
would contain various control bytes, which would be used in a switch
statement. This would eliminate the tests for special cases in the
switch(length) statement. Also by testing if you have enough memory
inside each case, you can put back the shortcut into the switch
statement and not have to pay the extra test 'if(1 == length)'. The
for loops for backtracking the iterator can be totally avoided.
E.g., in case 2 of the switch statement, you can write instead:
     if (!is_trail(*(it))) { return INCOMPLETE_SEQUENCE; }
     cp = ((cp << 6) & 0x7ff) + ((*++it) & 0x3f);

Lastly, the testing should really test *all* the boundary conditions
(that's all the cases in the table 3.6 of the unicode standard)
*and* also test invalid strings. Without a thorough test driver,
it's really hard to have any confidence into code such as the above
statement "((cp << 6) & 0x7ff) + ((*++it) & 0x3f)"....

CONCLUSION
Given that the Unicode standard is so precisely defined, I would
expect boost to provide only the highest-quality implementation. For
sure, any implementation is better than none, but by working together
(you and all interested boost members) I am sure that an excellent
library is feasible. You already provide an excellent design and a
good base. Keep it up, and let me know how I can help besides the
remarks above,

--
Hervé Brönnimann
hervebronnimann_at_[hidden]
On Dec 5, 2006, at 3:56 PM, Nemanja Trifunovic wrote:
> Hello everybody.
>
> This is the second call for the informal review of the UTF8  
> library. It is based on verson 1.02 of UTF8-CPP: http:// 
> utfcpp.sourceforge.net/ and you can find it at http://boost- 
> consulting.com/vault/index.php? 
> PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Stri 
> ngs%20-%20Text%20Processing
>
> The library is stable and is being used for several months now.
>
> Thanks in advance
>
> Nemanja Trifunovic

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk