Boost logo

Boost :

From: Nemanja Trifunovic (nemanja_trifunovic_at_[hidden])
Date: 2006-12-06 08:23:22


Hervé, First of all, many thanks for your valuable input. This is exactly the kind of review I was looking for. I'll try to answer your points: 1) Unicode glossary: In the first draft of the documentation (which was originaly an article at Code Project) I planned to include a short introduction to Unicode in general and UTF-8 in particular, but that "short" introduction started growing so fast that I decided to leave it out completely, and instead assume the reader is familiar with Unicode. Of course, this may be changed, but I am not sure Boost is the right place to teach Unicode. 2) bom vs byte_order_mark: I guess I have worked with language applications and libraries for too long :) We call it "bom" even when we speak. Anyway, you may be right, but I would like to hear what others think. For instance, "is_byte_order_mark" looks a little too verbose to me, but I don't insist on the name. 3) Exceptions vs return status code: Actually, that was a tough decision :) In general, the rule of thumb I follow when deciding on this is: "if the errors are exceptional (rare) in the common use case scenarios, throw. Otherwise, return a code". In the kind of applications I work with, ill-formed utf-8 is a rarity, but there may be cases when illegal utf-8 sequences are all over the place. For that scenarios, I provided functions find_invalid and is_valid. The typical way to use them would be to get a utf-8 encoded sequence, check it for validity, and then use functions from unchecked namespace which are very fast. 4) next and prior: In an early version of the library (may still be in SVN at Source Forge) I had a function get() that would decode the following UTF-8 sequence, without progreesing the iterator, but decided to leave it out and keep the interface minimal. With next() the calculation needs to be performed anyway, to determine if the sequence is valid. If the end of the sequence is not an issue, you can use unchecked::next() which does not check for the end of the sequence, but also skips the validity test. 5) advance() requires positive argument. One reason for this is that the performance would differ significantly for positive vs negative values. In fact at one point I was seriously considering to exclude prior() and insist on forward iterations - the only reason I keep it is its ability to find the beginning of a valid UTF-8 sequence given a position "in the middle" of it. 6) Iterator requirements. In the beginning, I was indeed planning to require bidirectional iterators, but than came to the conclusion that it doesn't make much sense - strings are *always* kept in sequences with random access, be it char*, std::string, CString, you name it. Therefore, the requirement for octet_iterator is random access. But you are right - I should explicitelly note this. 7) Correctness and testing: The test I included to the Boost submission contains only the "smoke test". The full test suite I run (including negative tests and boundary conditions) can be found at the Source Forge SVN: http://utfcpp.svn.sourceforge.net/viewvc/utfcpp/v1_0/test_drivers/ 8) Optimizations: I did invest some time into it, and somewhat improved the performance from the beta versions, but probably more work can be done in this regard. I tried to optimize the sequence_length by introducing the table you mention, but to my surprise it ran slower for the test case I was measuring, so I reverted it to the present (naive?) version. Thanks again for your comments, and I hope you will have more as the library progresses. Nemanja Trifunovic. ----- Original Message ---- From: Hervé Brönnimann <hervebronnimann_at_[hidden]> To: boost_at_[hidden] Sent: Wednesday, December 6, 2006 12:35:11 AM Subject: Re: [boost] UTF8 library - second call for informal review Nemanja: I don't use UTF8 myself, but I've had to program is_valid_utf8 and utf8_to_xml10 recently. You're lucky that I just gave is_valid as an assignment to a student! I haven't done a review in a long time, not even for formal reviews, and this is only an informal review. Given that unicode is a standard that *should* be supported by C++, really, this would be a prime candidate for inclusion in boost. I strongly support your efforts. I started writing a few comments, and it comes almost as a complete review :-) So here goes my mini-review. Keep in mind I want to be as constructive and positive as possible. I'm definitely hoping something will come out of this. DESIGN About the documentation: I wish there was a short glossary at the beginning (it's not provided by unicode.org - the glossary there is wayyyy too big!). Just to explain "code point", "byte sequence", "surrogate", etc. Some pointers to the unicode.org FAQ (e.g. for BOM). Also, before the reference, it would be nice to have a synopsis to refer to just to remember the interface. And as things will evolve (interface, design, etc.) you'll need to start a section Rationale at the end (before Conclusions). About the interface: The name "bom", even though it is inside namespace utf8, could be spelled out byte_order_mark (to help users know what to google for if they don't know the context). Generally, I don't think exceptions are appropriate here. A return status code would serve equally well and be more efficient. In fact, for append, it is enough to check that "result == append(cp, result)" to know you have an invalid code point. A similar design works with the other functions. next() and prior() should not return the code point. If the only goal is to advance the iterator, why impose additional calculations? If the code point is needed, have other functions called extract_next() and extract_prior(). Unicode has the advantage that you know where the sequences begin, so next(it,end) could even work if 'it' points to the middle of a byte sequence, but it would require to backtrack first to see the whole sequence if you need to return a value, even though this value may be discarded by the caller. Finally, what if end is not an issue (e.g. you know the input is valid and don't want to test for end of sequence)? You may choose not to support this if you want to promote safe usage, but for efficiency some users may require it. It would be better imho to have: template <typename octet_iterator> std::pair<octet_iterator,bool> next(octet_iterator it); template <typename octet_iterator> std::pair<octet_iterator,bool> next(octet_iterator it, octet_iterator end); // Return the beginning of the next byte sequence after the specified 'it'. // Note that this next beginning can never be equal to 'it'. If 'end' is specified // and is encountered before the next beginning is found, then return 'std::make_pair(it,true)'. // In case of an invalid UTF8 sequence, return 'std::make_pair(it2,false)' where 'it2' is the first // position that can be deduced to be invalid starting at 'it'. In this way, all outcomes can be decoded (easily and efficiently) from the return value. Similarly: template <typename octet_iterator> std::pair<octet_iterator,bool> extract_next(octet_iterator it, uint32_t *cp); template <typename octet_iterator> std::pair<octet_iterator,bool> extract_next(octet_iterator it, octet_iterator end, uint32_t* cp); // Return the beginning of the next byte sequence after the specified 'it', and // load the code point in the range between 'it' and the return value into '*cp' if 'cp' is not null. // Note that this next beginning can never be equal to 'it', and that 'it' should point to // the beginning of a byte sequence. If 'end' is specified and is encountered // before the next beginning is found, then return 'std::make_pair(it,true)'. // In case of an invalid UTF8 sequence, return 'std::make_pair(it2,false)' where 'it2' is the first // position that can be deduced to be invalid starting at 'it'. This is only a suggestion, I'm not sure it is the best interface. It nevertheless shows that exceptions can be completely avoided for next () and prior as well. Why is the parameter n in advance required to be positive? Likewise, advance could be reformulated as: template <typename octet_iterator, typename distance_type> std::pair<distance_type,bool> advance (octet_iterator& it, distance_type n); template <typename octet_iterator, typename distance_type> std::pair<distance_type,bool> advance (octet_iterator& it, distance_type n, octet_iterator end); // Advances an iterator 'it' by the specified number of code points within an UTF-8 sequence. // Note that that the return , and that 'it' should point to the beginning of a byte sequence. // If 'end' is specified and is encountered during the advance, then return 'std::make_pair(i,true)' // In case of an invalid UTF8 sequence, return 'std::make_pair(i,false)' and let 'it' point to the first // position that can be deduced to be invalid starting at 'it'. Note that if n>0, ++it will be used, // and if n<0, --it will be used and the iterator type will be required to be bidirectional. Both the interface and the implementation can be improved by specifying the requirements on iterators: it appears you want to support bidirectional iterators, but you do require random access (when you write 'end - it < length' for instance). I'm pretty sure it's possible to implement most functions with a ForwardIterator, and perhaps even an InputIterator. Advance might require bidirectional if n<0, that can be asserted from the iterator traits. Also the expected value type of the 'octet_iterator' could be explicited (needs to be convertible to an integral type, right?) IMPLEMENTATION: I have some doubts about the correctness of the implementation, since is_trail() always matches the range 80..BF which is good only for the third and fourth bytes, but not the special cases for the second byte. And the testing is certainly not sufficient to catch all the nooks and crannies of the definition. Recall that a byte sequence is valid if it conforms to the table 3.6 in the standard (below, sorry for the formatting): Table 3-6. Well-Formed UTF-8 Byte Sequences Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF Does your code check for exceptional 1st bytes such as C2, E0, ED, F0 or F4? I couldn't see how. Surely is_code_point_valid can be optimized. Perhaps the compiler optimizes things a bit, but I'm pretty sure it can be improved. More efficiency can be gained by tabulating the sequence_length function (you need a table of 256 entries): I'm taking this from the unicode.org sample code: /* * Index into the table below with the first byte of a UTF-8 sequence to * get the number of trailing bytes that are supposed to follow it. * Note that *legal* UTF-8 values can't have 4 or 5-bytes. */ static const char sequence_length_for_UTF8[256] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 }; Even the validate_next could be made a bit more efficient by tabulating the first byte: instead of the sequence length, the table would contain various control bytes, which would be used in a switch statement. This would eliminate the tests for special cases in the switch(length) statement. Also by testing if you have enough memory inside each case, you can put back the shortcut into the switch statement and not have to pay the extra test 'if(1 == length)'. The for loops for backtracking the iterator can be totally avoided. E.g., in case 2 of the switch statement, you can write instead: if (!is_trail(*(it))) { return INCOMPLETE_SEQUENCE; } cp = ((cp << 6) & 0x7ff) + ((*++it) & 0x3f); Lastly, the testing should really test *all* the boundary conditions (that's all the cases in the table 3.6 of the unicode standard) *and* also test invalid strings. Without a thorough test driver, it's really hard to have any confidence into code such as the above statement "((cp << 6) & 0x7ff) + ((*++it) & 0x3f)".... CONCLUSION Given that the Unicode standard is so precisely defined, I would expect boost to provide only the highest-quality implementation. For sure, any implementation is better than none, but by working together (you and all interested boost members) I am sure that an excellent library is feasible. You already provide an excellent design and a good base. Keep it up, and let me know how I can help besides the remarks above, -- Hervé Brönnimann hervebronnimann_at_[hidden] On Dec 5, 2006, at 3:56 PM, Nemanja Trifunovic wrote: > Hello everybody. > > This is the second call for the informal review of the UTF8 > library. It is based on verson 1.02 of UTF8-CPP: http:// > utfcpp.sourceforge.net/ and you can find it at http://boost- > consulting.com/vault/index.php? > PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Stri > ngs%20-%20Text%20Processing > > The library is stable and is being used for several months now. > > Thanks in advance > > Nemanja Trifunovic _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost ____________________________________________________________________________________ Have a burning question? Go to www.Answers.yahoo.com and get answers from real people who know.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk