|
Boost : |
From: Nemanja Trifunovic (nemanja_trifunovic_at_[hidden])
Date: 2006-12-06 08:23:22
Hervé,
First of all, many thanks for your valuable input. This is exactly the kind of review I was looking for. I'll try to answer your points:
1) Unicode glossary: In the first draft of the documentation (which was originaly an article at Code Project) I planned to include a short introduction to Unicode in general and UTF-8 in particular, but that "short" introduction started growing so fast that I decided to leave it out completely, and instead assume the reader is familiar with Unicode. Of course, this may be changed, but I am not sure Boost is the right place to teach Unicode.
2) bom vs byte_order_mark: I guess I have worked with language applications and libraries for too long :) We call it "bom" even when we speak. Anyway, you may be right, but I would like to hear what others think. For instance, "is_byte_order_mark" looks a little too verbose to me, but I don't insist on the name.
3) Exceptions vs return status code: Actually, that was a tough decision :) In general, the rule of thumb I follow when deciding on this is: "if the errors are exceptional (rare) in the common use case scenarios, throw. Otherwise, return a code". In the kind of applications I work with, ill-formed utf-8 is a rarity, but there may be cases when illegal utf-8 sequences are all over the place. For that scenarios, I provided functions find_invalid and is_valid. The typical way to use them would be to get a utf-8 encoded sequence, check it for validity, and then use functions from unchecked namespace which are very fast.
4) next and prior: In an early version of the library (may still be in SVN at Source Forge) I had a function get() that would decode the following UTF-8 sequence, without progreesing the iterator, but decided to leave it out and keep the interface minimal. With next() the calculation needs to be performed anyway, to determine if the sequence is valid. If the end of the sequence is not an issue, you can use unchecked::next() which does not check for the end of the sequence, but also skips the validity test.
5) advance() requires positive argument. One reason for this is that the performance would differ significantly for positive vs negative values. In fact at one point I was seriously considering to exclude prior() and insist on forward iterations - the only reason I keep it is its ability to find the beginning of a valid UTF-8 sequence given a position "in the middle" of it.
6) Iterator requirements. In the beginning, I was indeed planning to require bidirectional iterators, but than came to the conclusion that it doesn't make much sense - strings are *always* kept in sequences with random access, be it char*, std::string, CString, you name it. Therefore, the requirement for octet_iterator is random access. But you are right - I should explicitelly note this.
7) Correctness and testing: The test I included to the Boost submission contains only the "smoke test". The full test suite I run (including negative tests and boundary conditions) can be found at the Source Forge SVN: http://utfcpp.svn.sourceforge.net/viewvc/utfcpp/v1_0/test_drivers/
8) Optimizations: I did invest some time into it, and somewhat improved the performance from the beta versions, but probably more work can be done in this regard. I tried to optimize the sequence_length by introducing the table you mention, but to my surprise it ran slower for the test case I was measuring, so I reverted it to the present (naive?) version.
Thanks again for your comments, and I hope you will have more as the library progresses.
Nemanja Trifunovic.
----- Original Message ----
From: Hervé Brönnimann <hervebronnimann_at_[hidden]>
To: boost_at_[hidden]
Sent: Wednesday, December 6, 2006 12:35:11 AM
Subject: Re: [boost] UTF8 library - second call for informal review
Nemanja: I don't use UTF8 myself, but I've had to program
is_valid_utf8 and utf8_to_xml10 recently. You're lucky that I just
gave is_valid as an assignment to a student! I haven't done a review
in a long time, not even for formal reviews, and this is only an
informal review.
Given that unicode is a standard that *should* be supported by C++,
really, this would be a prime candidate for inclusion in boost. I
strongly support your efforts. I started writing a few comments, and
it comes almost as a complete review :-) So here goes my mini-review.
Keep in mind I want to be as constructive and positive as possible.
I'm definitely hoping something will come out of this.
DESIGN
About the documentation: I wish there was a short glossary at the
beginning (it's not provided by unicode.org - the glossary there is
wayyyy too big!). Just to explain "code point", "byte sequence",
"surrogate", etc. Some pointers to the unicode.org FAQ (e.g. for
BOM). Also, before the reference, it would be nice to have a
synopsis to refer to just to remember the interface. And as things
will evolve (interface, design, etc.) you'll need to start a section
Rationale at the end (before Conclusions).
About the interface: The name "bom", even though it is inside
namespace utf8, could be spelled out byte_order_mark (to help users
know what to google for if they don't know the context).
Generally, I don't think exceptions are appropriate here. A return
status code would serve equally well and be more efficient. In fact,
for append, it is enough to check that "result == append(cp, result)"
to know you have an invalid code point. A similar design works with
the other functions.
next() and prior() should not return the code point. If the only goal
is to advance the iterator, why impose additional calculations? If
the code point is needed, have other functions called extract_next()
and extract_prior(). Unicode has the advantage that you know where
the sequences begin, so next(it,end) could even work if 'it' points
to the middle of a byte sequence, but it would require to backtrack
first to see the whole sequence if you need to return a value, even
though this value may be discarded by the caller. Finally, what if
end is not an issue (e.g. you know the input is valid and don't want
to test for end of sequence)? You may choose not to support this if
you want to promote safe usage, but for efficiency some users may
require it. It would be better imho to have:
template <typename octet_iterator>
std::pair<octet_iterator,bool> next(octet_iterator it);
template <typename octet_iterator>
std::pair<octet_iterator,bool> next(octet_iterator it,
octet_iterator end);
// Return the beginning of the next byte sequence after
the specified 'it'.
// Note that this next beginning can never be equal to
'it'. If 'end' is specified
// and is encountered before the next beginning is
found, then return 'std::make_pair(it,true)'.
// In case of an invalid UTF8 sequence, return
'std::make_pair(it2,false)' where 'it2' is the first
// position that can be deduced to be invalid starting
at 'it'.
In this way, all outcomes can be decoded (easily and efficiently)
from the return value. Similarly:
template <typename octet_iterator>
std::pair<octet_iterator,bool> extract_next(octet_iterator
it, uint32_t *cp);
template <typename octet_iterator>
std::pair<octet_iterator,bool> extract_next(octet_iterator
it, octet_iterator end, uint32_t* cp);
// Return the beginning of the next byte sequence after
the specified 'it', and
// load the code point in the range between 'it' and the
return value into '*cp' if 'cp' is not null.
// Note that this next beginning can never be equal to
'it', and that 'it' should point to
// the beginning of a byte sequence. If 'end' is
specified and is encountered
// before the next beginning is found, then return
'std::make_pair(it,true)'.
// In case of an invalid UTF8 sequence, return
'std::make_pair(it2,false)' where 'it2' is the first
// position that can be deduced to be invalid starting
at 'it'.
This is only a suggestion, I'm not sure it is the best interface. It
nevertheless shows that exceptions can be completely avoided for next
() and prior as well.
Why is the parameter n in advance required to be positive? Likewise,
advance could be reformulated as:
template <typename octet_iterator, typename distance_type>
std::pair<distance_type,bool> advance (octet_iterator& it,
distance_type n);
template <typename octet_iterator, typename distance_type>
std::pair<distance_type,bool> advance (octet_iterator& it,
distance_type n, octet_iterator end);
// Advances an iterator 'it' by the specified number of
code points within an UTF-8 sequence.
// Note that that the return , and that 'it' should
point to the beginning of a byte sequence.
// If 'end' is specified and is encountered during the
advance, then return 'std::make_pair(i,true)'
// In case of an invalid UTF8 sequence, return
'std::make_pair(i,false)' and let 'it' point to the first
// position that can be deduced to be invalid starting
at 'it'. Note that if n>0, ++it will be used,
// and if n<0, --it will be used and the iterator type
will be required to be bidirectional.
Both the interface and the implementation can be improved by
specifying the requirements on iterators: it appears you want to
support bidirectional iterators, but you do require random access
(when you write 'end - it < length' for instance). I'm pretty sure
it's possible to implement most functions with a ForwardIterator, and
perhaps even an InputIterator. Advance might require bidirectional
if n<0, that can be asserted from the iterator traits. Also the
expected value type of the 'octet_iterator' could be explicited
(needs to be convertible to an integral type, right?)
IMPLEMENTATION:
I have some doubts about the correctness of the implementation, since
is_trail() always matches the range 80..BF which is good only for the
third and fourth bytes, but not the special cases for the second
byte. And the testing is certainly not sufficient to catch all the
nooks and crannies of the definition. Recall that a byte sequence is
valid if it conforms to the table 3.6 in the standard (below, sorry
for the formatting):
Table 3-6. Well-Formed UTF-8 Byte Sequences
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Does your code check for exceptional 1st bytes such as C2, E0, ED, F0
or F4? I couldn't see how.
Surely is_code_point_valid can be optimized. Perhaps the compiler
optimizes things a bit, but I'm pretty sure it can be improved.
More efficiency can be gained by tabulating the sequence_length
function (you need a table of 256 entries): I'm taking this from the
unicode.org sample code:
/*
* Index into the table below with the first byte of a UTF-8
sequence to
* get the number of trailing bytes that are supposed to follow it.
* Note that *legal* UTF-8 values can't have 4 or 5-bytes.
*/
static const char sequence_length_for_UTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
Even the validate_next could be made a bit more efficient by
tabulating the first byte: instead of the sequence length, the table
would contain various control bytes, which would be used in a switch
statement. This would eliminate the tests for special cases in the
switch(length) statement. Also by testing if you have enough memory
inside each case, you can put back the shortcut into the switch
statement and not have to pay the extra test 'if(1 == length)'. The
for loops for backtracking the iterator can be totally avoided.
E.g., in case 2 of the switch statement, you can write instead:
if (!is_trail(*(it))) { return INCOMPLETE_SEQUENCE; }
cp = ((cp << 6) & 0x7ff) + ((*++it) & 0x3f);
Lastly, the testing should really test *all* the boundary conditions
(that's all the cases in the table 3.6 of the unicode standard)
*and* also test invalid strings. Without a thorough test driver,
it's really hard to have any confidence into code such as the above
statement "((cp << 6) & 0x7ff) + ((*++it) & 0x3f)"....
CONCLUSION
Given that the Unicode standard is so precisely defined, I would
expect boost to provide only the highest-quality implementation. For
sure, any implementation is better than none, but by working together
(you and all interested boost members) I am sure that an excellent
library is feasible. You already provide an excellent design and a
good base. Keep it up, and let me know how I can help besides the
remarks above,
--
Hervé Brönnimann
hervebronnimann_at_[hidden]
On Dec 5, 2006, at 3:56 PM, Nemanja Trifunovic wrote:
> Hello everybody.
>
> This is the second call for the informal review of the UTF8
> library. It is based on verson 1.02 of UTF8-CPP: http://
> utfcpp.sourceforge.net/ and you can find it at http://boost-
> consulting.com/vault/index.php?
> PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Stri
> ngs%20-%20Text%20Processing
>
> The library is stable and is being used for several months now.
>
> Thanks in advance
>
> Nemanja Trifunovic
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
____________________________________________________________________________________
Have a burning question?
Go to www.Answers.yahoo.com and get answers from real people who know.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk