Boost logo

Boost :

Subject: Re: [boost] GSoC Unicode library: second preview
From: Scott McMurray (me22.ca+boost_at_[hidden])
Date: 2009-06-20 15:22:55


2009/6/20 Artyom <artyomtnk_at_[hidden]>
>
> > UTF-16 ... This is the recommended encoding for dealing with
> > Unicode internally for general purposes
>
> To be honest, it is most error prone encoding to work with Unicode:
>

Amen.

Really, I don't see why people don't just use UTF-8 all over the
place. Even UTF-32 isn't as convenient as most would like, since you
still have combining code points and other similar complications.

As a programmer what I really care about is usually some nebulous
concept of "characters", and one character can easily be 3 codepoints
or 1/3 of a codepoint.

It feels like the only way to get Unicode string handling right (at
the application level, not library or render levels) is to deal
entirely in strings and regexes.

Suppose I have "difficult" with the "ffi" ligature codepoint, and I do
a perl-style split on /i/. I should probably be getting "d", the "ff"
ligature codepoint, and "cult". I know if I tried to code that by
hand in every application I'd miss all kinds of evil corner cases like
that.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk