Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Cory Nelson (phrosty_at_[hidden])
Date: 2009-07-17 15:02:53


On Thu, Feb 12, 2009 at 6:32 PM, Phil
Endecott<spam_from_boost_dev_at_[hidden]> wrote:
> Cory Nelson wrote:
>>
>> Is there interest in having a Unicode codec library submitted to Boost?
>>
>> Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008):
>> http://svn.int64.org/viewvc/int64/snips/unicode.hpp
>>
>> Right now it is pretty simple to use:
>>
>> transcode<utf8, utf16le>(forwarditerator, forwarditeratorend,
>> outputiterator, traits [, maximum]);
>> transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [,
>> maximum]);
>>
>> There is also a codecvt facet which supports any-to-any.
>>
>> Supports UTF-8, UTF-16, and UTF-32, in little or big endian.  Has a
>> special wchar_encoding that maps to UTF-16 or UTF-32 depending on your
>> platform.  A traits class controls error handling.
>
> Hi Cory,
>
> Yes, Boost definitely ought to have Unicode conversion.  Yours is not the
> first proposal, and there is actually already one implementation hidden
> inside another Boost library.  I wrote some UTF conversion code a while ago
> and was trying to build a more comprehensive character set library around it
> but realised after a while that that was too mammoth a job.  So I think that
> getting _just_ the UTF conversion into Boost, ASAP, is the right thing to
> do.
>
> I've had a look at your code.  I like that you have implemented what I
> called an "error policy".  It is wasteful to continuously check the validity
> of input that comes from trusted code, but important to check it when it
> comes from an untrusted source.  However I'm not sure that your actual
> conversion is as efficient as it could be.  I spent quite a while profiling
> my UTF8 conversion and came up with this:
>
>    http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
>
> I think that you could largely copy&paste bits of that into the right places
> in your algorithm and get a significant speedup.

I finally found some time to do some optimizations of my own and have
had some good progress using a small lookup table, a switch, and
slightly deducing branches. See line 318:

http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup

Despite these efforts, Windows 7 still decodes UTF-8 three times
faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either
using some gigantic look up tables or SSE.

-- 
Cory Nelson
http://int64.org

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk