Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] GSoC Unicode library: second preview
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-06-21 15:35:48

Next message: Robert Kawulak: "Re: [boost] SafeInt code proposal"
Previous message: Omer Katz: "Re: [boost] SafeInt code proposal"
In reply to: Mathias Gaunard: "[boost] GSoC Unicode library: second preview"
Next in thread: Mathias Gaunard: "Re: [boost] GSoC Unicode library: second preview"
Reply: Mathias Gaunard: "Re: [boost] GSoC Unicode library: second preview"

Mathias Gaunard wrote:
> Here is the documentation of the current state of the Unicode library
> that I am doing as a google summer of code project:
> http://blogloufoque.free.fr/unicode/doc/html/

Hi Mathias,

I have looked quickly at your UTF8 code at
https://svn.boost.org/trac/boost/browser/sandbox/SOC/2009/unicode/boost/unicode/utf_codecs.hpp
in comparison with mine at
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh . The
encoding is similar, though I have avoided some code duplication (which
is probably worthwhile in an inline function) and used an IF_LIKELY
macro to enable gcc's branch hinting.

My decoding implementation is rather different than yours, though. You
explicitly determine the length of the code first and then loop , while
I do this:

   static char32_t decode(const_char8_ptr_t& p) {
     char8_t b0 = *(p++);
     IF_LIKELY((b0&0x80)==0) {
       return b0;
     }
     char8_t b1 = *(p++);
     check((b1&0xc0)==0x80);
     IF_LIKELY((b0&0xe0)==0xc0) {
       char32_t r = (b1&0x3f) | ((b0&0x1f)<<6);
       check(r>=0x80);
       return r;
     }
     char8_t b2 = *(p++);
     check((b2&0xc0)==0x80);
     IF_LIKELY((b0&0xf0)==0xe0) {
       char32_t r = (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
       check(r>=0x800);
       return r;
     }
     char8_t b3 = *(p++);
     check((b3&0xc0)==0x80);
     IF_LIKELY((b0&0xf8)==0xf0) {
       char32_t r = (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);
       check(r>=0x10000);
       return r;
     }
   }

You may find that that is faster.

Regarding the character database, the size is an issue. Can unwanted
parts be omitted? For example, I would guess that the character names
are not often used except for debugging messages and they are probably
a large part of it.

Regards, Phil.

Next message: Robert Kawulak: "Re: [boost] SafeInt code proposal"
Previous message: Omer Katz: "Re: [boost] SafeInt code proposal"
In reply to: Mathias Gaunard: "[boost] GSoC Unicode library: second preview"
Next in thread: Mathias Gaunard: "Re: [boost] GSoC Unicode library: second preview"
Reply: Mathias Gaunard: "Re: [boost] GSoC Unicode library: second preview"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk