Boost :

Date view	Thread view	Subject view	Author view

From: Karl Nelson (kenelson_at_[hidden])
Date: 2000-08-23 11:40:47

Next message: Jens Maurer: "Re: [boost] End of Review: Regular Expression Library"
Previous message: Darin Adler: "Re: [boost] Re: formatting manipulator for UTF-8"
In reply to: David Abrahams: "Re: [boost] Re: formatting manipulator for UTF-8"
Next in thread: Dietmar Kuehl: "Re: [boost] Re: formatting manipulator for UTF-8"

> Just one question: can someone fill me in on UTF-8? What is it, and what are
> the differences from ASCII? Is it just 7-bit ASCII plus one extra bit?

Darin already answered this, but here is the chart for those who are lazy.

-------------------------------------------------------------------------

Common formats
--------------
UCS-2 - 16 bit (wchar_t) format
UCS-4 - 32 bit (int) format
UTF - UCS Transformation Format (packs 32 bits into 8 bit multilength code)
UTF-1 - (obsolete) Intended for sending through 7 bit mail exchangers
UTF-7 - a 7-bit shift code for mail exchangers (reuses ASCII)
UTF-8 - Transform from UCS-4 to 8 bit code.
UTF-16 - 16 bit code for packing UCS-4 into UCS-2 (using researved codes)

Overview of UTF-8.
-------------------

The transform is defined as this.

UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx

UCS-4 codes should pack into the smallest available format.

UTF-8 is noteworthy because it does not reuse any ASCII codes and
it is possible to tell in mid stream if we are in an extension byte.

RFC
  UTF-8 http://www.faqs.org/rfcs/rfc2279.html
  UTF-7 http://www.faqs.org/rfcs/rfc2152.html
  UTF-16 http://www.faqs.org/rfcs/rfc2781.html

-----

This is why I my format was for i18n and was only 8 bits.
The user whose device is UTF-8 would deal with it like a
basic_stream<char> but whenever they had a wchar_t string
they would just code it.

cout << "hello " << to_utf8(wstring(L"bob"));

The same goes for input.

  string s;
  cin >> s;
  wstring name=from_utf8(s);

Of course having a wide version of format isn't going to hurt
as it would still solve the 3 problems necessary there.

  - prevent broken text strings
  - allow reordering
  - allow non-sticky formatting

Hope that brings you up to speed.

--Karl

Next message: Jens Maurer: "Re: [boost] End of Review: Regular Expression Library"
Previous message: Darin Adler: "Re: [boost] Re: formatting manipulator for UTF-8"
In reply to: David Abrahams: "Re: [boost] Re: formatting manipulator for UTF-8"
Next in thread: Dietmar Kuehl: "Re: [boost] Re: formatting manipulator for UTF-8"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk