Boost logo

Boost :

Subject: Re: [boost] GSoC Unicode library: second preview
From: Artyom (artyomtnk_at_[hidden])
Date: 2009-06-23 03:56:13


> Of course, the library works with UTF-8 and UTF-32 just as
> well, it makes no difference to the generic algorithms
> (which don't exist yet, but expect substring searching and
> the like), it's up to you to choose what makes the most
> sense to use for your situation (for example, you may choose
> to use UTF-8 because you need to interact a lot with
> programming interfaces expecting that format).
>

Ok, this is really good.

> They should be fairly easy to find.
> Either you're using the algorithm that does the task
> correctly, or you're fiddling with the encoding by hand
> which is likely to be wrong.

They are easy to find in the Unicode aware unit tests but not
in real program. I did once a small test, what Unicode aware
programs support characters outside of BMP, i.e. I tested
a glyph that was encoded as surrogate pair in UTF-16...

The results were total disaster:

- Windows standard dialogs: displayed character correctly but
  every operation like deletion related to is as two pairs. For example
  file name dialog had problems.
- Same behavior in notepad or any standard text-area widgets didn't
  work correctly.
- Qt3 hadn't supported surrogate pairs at all (in Qt4 most of it was
  fixed) displaying two square "glyphs".
- Opera Web browser, had similar problems with editing and displaying
  such characters.

So... There is a huge problem with this encoding, because such simple
QA test shouldn't give such bad results for such big amount of programs.

Also, all programs that used internally utf-8 or utf-32 had passed these
tests very well.

So I really **do not** suggest recommending this encoding as "best"
one for internal use.

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk