|
Boost : |
Subject: Re: [boost] GSoC Unicode library: second preview
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-06-21 15:35:48
Mathias Gaunard wrote:
> Here is the documentation of the current state of the Unicode library
> that I am doing as a google summer of code project:
> http://blogloufoque.free.fr/unicode/doc/html/
Hi Mathias,
I have looked quickly at your UTF8 code at
https://svn.boost.org/trac/boost/browser/sandbox/SOC/2009/unicode/boost/unicode/utf_codecs.hpp
in comparison with mine at
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh . The
encoding is similar, though I have avoided some code duplication (which
is probably worthwhile in an inline function) and used an IF_LIKELY
macro to enable gcc's branch hinting.
My decoding implementation is rather different than yours, though. You
explicitly determine the length of the code first and then loop , while
I do this:
static char32_t decode(const_char8_ptr_t& p) {
char8_t b0 = *(p++);
IF_LIKELY((b0&0x80)==0) {
return b0;
}
char8_t b1 = *(p++);
check((b1&0xc0)==0x80);
IF_LIKELY((b0&0xe0)==0xc0) {
char32_t r = (b1&0x3f) | ((b0&0x1f)<<6);
check(r>=0x80);
return r;
}
char8_t b2 = *(p++);
check((b2&0xc0)==0x80);
IF_LIKELY((b0&0xf0)==0xe0) {
char32_t r = (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
check(r>=0x800);
return r;
}
char8_t b3 = *(p++);
check((b3&0xc0)==0x80);
IF_LIKELY((b0&0xf8)==0xf0) {
char32_t r = (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);
check(r>=0x10000);
return r;
}
}
You may find that that is faster.
Regarding the character database, the size is an issue. Can unwanted
parts be omitted? For example, I would guess that the character names
are not often used except for debugging messages and they are probably
a large part of it.
Regards, Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk