Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Cory Nelson (phrosty_at_[hidden])
Date: 2009-07-18 10:44:15


On Sat, Jul 18, 2009 at 5:02 AM, Phil
Endecott<spam_from_boost_dev_at_[hidden]> wrote:
> Cory Nelson wrote:
>>
>> I finally found some time to do some optimizations of my own and have
>> had some good progress using a small lookup table, a switch, and
>> slightly deducing branches.  See line 318:
>>
>> http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
>>
>> Despite these efforts, Windows 7 still decodes UTF-8 three times
>> faster (~750MiB/s vs ~240MiB/s on my Core 2.  I assume they are either
>> using some gigantic look up tables or SSE.
>
> Hi Cory,
>
> What is your test input?

i'm using a large file with a mix of many languages in it. JMDict,
available here: http://www.csse.monash.edu.au/~jwb/jmdict.html

> When the input is largely ASCII, a worthwhile optimisation is to cast groups
> of 4 (or 8) characters to ints and & with 0x80808080; if the answer is zero,
> no further conversion is needed.

It is something I've considered, but it is a bit harder to translate
to working with generic iterators.

> In general I'm unsure of the performance issues of lookup tables compared to
> explicit bit-manipulation.  Cache effects may be significant, and a
> benchmark will tend to warm up the cache better than a real application
> might.
>
> I can't see how SSE could be applied to this problem, but it's not something
> I know much about.

I believe there is a SSE algo out there, but it of course won't work
with iterators.

> I don't have much time to work on this right now, but if the algorithm plus
> test harness and test data were bundled up into something that I can just
> "make", I will try to compare it with my version.

I will try to upload my benchmarking code somewhere today.

-- 
Cory Nelson
http://int64.org

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk