Boost logo

Boost :

Subject: Re: [boost] GSoC Unicode library: second preview
From: Artyom (artyomtnk_at_[hidden])
Date: 2009-06-20 12:11:08


Hello,

> Here is the documentation of the
> current state of the Unicode library that I am doing as a
> google summer of code project:
> http://blogloufoque.free.fr/unicode/doc/html/
[snip]

Where is the source code?

....

Some notes:

> UTF-16 ... This is the recommended encoding for dealing with
> Unicode internally for general purposes

To be honest, it is most error prone encoding to work with Unicode:

1. It is variable length encoding
2. There surragate charrecters are quite rare and thus it is very
   hard to find bugs related with it.

It was mostly born as a "mistake" at the beggining of the unicode
when it was beleved that 16bit is enough for signle code point.
So many software platforms adopted 16 bit encoding that supported
only BMP, As a result you can **easily** find **huge** amount of
bugs in the code that uses utf-16, In most of cases such bugs
are hard to track because these code points are rare.

For example, try to edit file-name in Windows with a charrecter that
not in BMP you would see that you need to press "delete" twice, try
to write such charecter in Qt3 application... that would just not work;
There are many examples of it.

So, I would be aware of recommending this encoding as internal encoding,
just because many platforms use it.

> UTF-32 ... This encoding isn't really recommended

As I mentioned above, it is not quite true, it is much safer encoding
to work with,

So I would recommend not to write such "suggestions".

More notes:
-----------

- For boundary checks I'd suggest to use ICU or Qt4 like API: iterate
  over string and return each time next bound. Not check if there is
  a bound on specific character.

- Examples and More description is required

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk