Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2009-07-17 17:29:16


On Fri, Jul 17, 2009 at 20:02, Cory Nelson <phrosty_at_[hidden]> wrote:
>
> On Thu, Feb 12, 2009 at 6:32 PM, Phil
> Endecott<spam_from_boost_dev_at_[hidden]> wrote:
> > Cory Nelson wrote:
> >>
> >> Is there interest in having a Unicode codec library submitted to Boost?
> >>

> > I've had a look at your code.  I like that you have implemented what I
> > called an "error policy".  It is wasteful to continuously check the validity
> > of input that comes from trusted code, but important to check it when it
> > comes from an untrusted source.  However I'm not sure that your actual
> > conversion is as efficient as it could be.  I spent quite a while profiling
> > my UTF8 conversion and came up with this:
> >
> >    http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
> >
> > I think that you could largely copy&paste bits of that into the right places
> > in your algorithm and get a significant speedup.
>
> I finally found some time to do some optimizations of my own and have
> had some good progress using a small lookup table, a switch, and
> slightly deducing branches.  See line 318:
>
> http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
>
> Despite these efforts, Windows 7 still decodes UTF-8 three times
> faster (~750MiB/s vs ~240MiB/s on my Core 2.  I assume they are either
> using some gigantic look up tables or SSE.

Dear Cory,

Though I'm not sure decoding this much UTF8-encoded data is often a
bottleneck in practice, let me add my thoughts. I wrote a
utf8_codecvt::do_in that uses lookup tables for a Unicode library that
Graham and I started long ago (in the vault). I haven't actually
compared performance of my implementation against yours, so mine may
well be slower. However, the technique may be of interest.

>From the first UTF-8 byte, the values and the valid ranges of the 0 to
3 following bytes is known (Table 3-7 in the Unicode standard version
5.0). Therefore, you can put these in lookup tables (7 tables with 256
entries each, with a 32-bit code point value). A master table contains
256 entries mapping the first byte to 0 to 3 following tables. You
find the code point values for consecutive bytes in the appropriate
tables and OR them. The trick that I suspect gave me speed-up compared
to the function the Unicode Consortium publishes, is that code point
values for invalid bytes are 0xffffffff. After OR-ing all values, you
need only one check to see whether there was an error. This reduces
branches.

As I said, I'm not sure it's actually faster than your code, but it
may be an interesting technique. The debug compilation of
utf8_data.cpp, which contains the tables and nothing else, is 30kB on
Linux gcc; hardly "gigantic". However, with a threefold speed increase
it seems likely that Windows uses bigger tables or something smarter
than I came up with.

To answer your original question, "is there any interest in Unicode
codecs?", yes.

It now seems to me that a full Unicode library would be hard to get
accepted into Boost; it would be more feasible to get a UTF library
submitted, which is more along the lines of your library. (A Unicode
library could later be based on the same principles.)

Freestanding transcoding functions and codecvt facets are not the only
thing I believe a UTF library would need, though. I'd add to the list:
- iterator adaptors (input and output);
- range adaptors;
- a code point string;
- compile-time encoding (meta-programming);
- documentation.

(The compile-time encoding thing may sound esoteric. I believe would
be useful for fast parsers. It's not that hard to do at any rate.) I
suspect the string is a pain to design to please everyone. Iterator
adaptors, I found, are a pain to attach error policies to and write
them correctly. For example, with a policy equivalent to your
"ReplaceCheckFailures", you need to produce the same code point
sequence whether you traverse an invalid encoded string forward or
backward. I've got code for UTF-8 that passes my unit tests, but the
error checking and the one-by-one decoding makes it much harder to
optimise.

I believe that Mathias Gaunard is working on a library at
<http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how
complete it is, but from the documentation it looks well thought-out
so far. I'm looking forward to seeing where that's going!

Cheers,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk