Subject: Re: [boost] GSoC Unicode library: second preview
From: Artyom (artyomtnk_at_[hidden])
Date: 2009-06-20 12:11:08
> Here is the documentation of the
> current state of the Unicode library that I am doing as a
> google summer of code project:
Where is the source code?
> UTF-16 ... This is the recommended encoding for dealing with
> Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode:
1. It is variable length encoding
2. There surragate charrecters are quite rare and thus it is very
hard to find bugs related with it.
It was mostly born as a "mistake" at the beggining of the unicode
when it was beleved that 16bit is enough for signle code point.
So many software platforms adopted 16 bit encoding that supported
only BMP, As a result you can **easily** find **huge** amount of
bugs in the code that uses utf-16, In most of cases such bugs
are hard to track because these code points are rare.
For example, try to edit file-name in Windows with a charrecter that
not in BMP you would see that you need to press "delete" twice, try
to write such charecter in Qt3 application... that would just not work;
There are many examples of it.
So, I would be aware of recommending this encoding as internal encoding,
just because many platforms use it.
> UTF-32 ... This encoding isn't really recommended
As I mentioned above, it is not quite true, it is much safer encoding
to work with,
So I would recommend not to write such "suggestions".
- For boundary checks I'd suggest to use ICU or Qt4 like API: iterate
over string and return each time next bound. Not check if there is
a bound on specific character.
- Examples and More description is required