Boost logo

Boost :

From: Darin Adler (darin_at_[hidden])
Date: 2000-08-23 10:10:30


on 8/23/00 7:39 AM, David Abrahams at abrahams_at_[hidden] wrote:

> Just one question: can someone fill me in on UTF-8? What is it, and what are
> the differences from ASCII? Is it just 7-bit ASCII plus one extra bit?

There's a good FAQ at <http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

Here are two ways to look at UTF-8.

One way to look at it is as a variable-width encoding for the ISO standard
character set ISO 10646 (the ISO standard related to Unicode -- see the FAQ
above for more precise details). It lets you encode any of the UCS codes as
a sequence of 1-6 bytes instead of using fixed-width sequences (as in the
UCS-2 or UCS-4 encoding). The UTF-8 encoding is designed so that the bytes
0x00-0x7F are only used to encode those ASCII characters and won't appear as
a subsequent byte in a multi-byte sequence.

Another way to look at UTF-8 is as a way to handle a larger character set
without changing your program much. If you have a program that already
handles strings of 8-bit bytes, and the program doesn't assume that the
boundaries between bytes are also the boundaries between characters, you may
be able to use the UTF-8 encoding and change very little of the program
instead of revising the program to use wider characters.

There's some confusion about whether a program that uses UTF-8 is "really"
supporting the character set. My understanding is that neither using the
UCS-2 or UCS-4 encoding nor using the UTF-8 encoding guarantees or prevents
proper handling of characters. Some operations are more difficult with
variable-width encoding, while others are more difficult with fixed-width
encoding.

    -- Darin


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk