Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Cory Nelson (phrosty_at_[hidden])
Date: 2009-07-18 10:34:47
On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd_at_[hidden]> wrote:
> On Fri, Jul 17, 2009 at 20:02, Cory Nelson <phrosty_at_[hidden]> wrote:
>> On Thu, Feb 12, 2009 at 6:32 PM, Phil
>> Endecott<spam_from_boost_dev_at_[hidden]> wrote:
>> > Cory Nelson wrote:
>> >> Is there interest in having a Unicode codec library submitted to Boost?
>> > I've had a look at your code. Â I like that you have implemented what I
>> > called an "error policy". Â It is wasteful to continuously check the validity
>> > of input that comes from trusted code, but important to check it when it
>> > comes from an untrusted source. Â However I'm not sure that your actual
>> > conversion is as efficient as it could be. Â I spent quite a while profiling
>> > my UTF8 conversion and came up with this:
>> > Â Â http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
>> > I think that you could largely copy&paste bits of that into the right places
>> > in your algorithm and get a significant speedup.
>> I finally found some time to do some optimizations of my own and have
>> had some good progress using a small lookup table, a switch, and
>> slightly deducing branches. Â See line 318:
>> Despite these efforts, Windows 7 still decodes UTF-8 three times
>> faster (~750MiB/s vs ~240MiB/s on my Core 2. Â I assume they are either
>> using some gigantic look up tables or SSE.
> Dear Cory,
> Though I'm not sure decoding this much UTF8-encoded data is often a
> bottleneck in practice, let me add my thoughts. I wrote a
> utf8_codecvt::do_in that uses lookup tables for a Unicode library that
> Graham and I started long ago (in the vault). I haven't actually
> compared performance of my implementation against yours, so mine may
> well be slower. However, the technique may be of interest.
UTF-8 is the primary bottleneck in XML decoding. That's been my
motivation thus far.
> From the first UTF-8 byte, the values and the valid ranges of the 0 to
> 3 following bytes is known (Table 3-7 in the Unicode standard version
> 5.0). Therefore, you can put these in lookup tables (7 tables with 256
> entries each, with a 32-bit code point value). A master table contains
> 256 entries mapping the first byte to 0 to 3 following tables. You
> find the code point values for consecutive bytes in the appropriate
> tables and OR them. The trick that I suspect gave me speed-up compared
> to the function the Unicode Consortium publishes, is that code point
> values for invalid bytes are 0xffffffff. After OR-ing all values, you
> need only one check to see whether there was an error. This reduces
Sounds interesting, I will try this out and see what happens.
> As I said, I'm not sure it's actually faster than your code, but it
> may be an interesting technique. The debug compilation of
> utf8_data.cpp, which contains the tables and nothing else, is 30kB on
> Linux gcc; hardly "gigantic". However, with a threefold speed increase
> it seems likely that Windows uses bigger tables or something smarter
> than I came up with.
> To answer your original question, "is there any interest in Unicode
> codecs?", yes.
> It now seems to me that a full Unicode library would be hard to get
> accepted into Boost; it would be more feasible to get a UTF library
> submitted, which is more along the lines of your library. (A Unicode
> library could later be based on the same principles.)
> Freestanding transcoding functions and codecvt facets are not the only
> thing I believe a UTF library would need, though. I'd add to the list:
> - iterator adaptors (input and output);
> - range adaptors;
> - a code point string;
> - compile-time encoding (meta-programming);
> - documentation.
I agree, mostly. I'm not sure if a special string (as opposed to
basic_string<utf16_t>) would be worthwhile -- what would you do with
it that didn't require a full Unicode library supporting it?
> (The compile-time encoding thing may sound esoteric. I believe would
> be useful for fast parsers. It's not that hard to do at any rate.) I
> suspect the string is a pain to design to please everyone. Iterator
> adaptors, I found, are a pain to attach error policies to and write
> them correctly. For example, with a policy equivalent to your
> "ReplaceCheckFailures", you need to produce the same code point
> sequence whether you traverse an invalid encoded string forward or
> backward. I've got code for UTF-8 that passes my unit tests, but the
> error checking and the one-by-one decoding makes it much harder to
> I believe that Mathias Gaunard is working on a library at
> <http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how
> complete it is, but from the documentation it looks well thought-out
> so far. I'm looking forward to seeing where that's going!
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Cory Nelson http://int64.org
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk