Boost logo

Boost :

Subject: Re: [boost] GSoC Unicode library: second preview
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-06-21 15:35:48

Mathias Gaunard wrote:
> Here is the documentation of the current state of the Unicode library
> that I am doing as a google summer of code project:

Hi Mathias,

I have looked quickly at your UTF8 code at
in comparison with mine at . The
encoding is similar, though I have avoided some code duplication (which
is probably worthwhile in an inline function) and used an IF_LIKELY
macro to enable gcc's branch hinting.

My decoding implementation is rather different than yours, though. You
explicitly determine the length of the code first and then loop , while
I do this:

   static char32_t decode(const_char8_ptr_t& p) {
     char8_t b0 = *(p++);
     IF_LIKELY((b0&0x80)==0) {
       return b0;
     char8_t b1 = *(p++);
     IF_LIKELY((b0&0xe0)==0xc0) {
       char32_t r = (b1&0x3f) | ((b0&0x1f)<<6);
       return r;
     char8_t b2 = *(p++);
     IF_LIKELY((b0&0xf0)==0xe0) {
       char32_t r = (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
       return r;
     char8_t b3 = *(p++);
     IF_LIKELY((b0&0xf8)==0xf0) {
       char32_t r = (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);
       return r;

You may find that that is faster.

Regarding the character database, the size is an issue. Can unwanted
parts be omitted? For example, I would guess that the character names
are not often used except for debugging messages and they are probably
a large part of it.

Regards, Phil.

Boost list run by bdawes at, gregod at, cpdaniel at, john at