Boost logo

Boost :

From: Karl Nelson (kenelson_at_[hidden])
Date: 2000-08-23 11:40:47

> Just one question: can someone fill me in on UTF-8? What is it, and what are
> the differences from ASCII? Is it just 7-bit ASCII plus one extra bit?

Darin already answered this, but here is the chart for those who are lazy.


Common formats
UCS-2 - 16 bit (wchar_t) format
UCS-4 - 32 bit (int) format
UTF - UCS Transformation Format (packs 32 bits into 8 bit multilength code)
UTF-1 - (obsolete) Intended for sending through 7 bit mail exchangers
UTF-7 - a 7-bit shift code for mail exchangers (reuses ASCII)
UTF-8 - Transform from UCS-4 to 8 bit code.
UTF-16 - 16 bit code for packing UCS-4 into UCS-2 (using researved codes)

Overview of UTF-8.

The transform is defined as this.

UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx

UCS-4 codes should pack into the smallest available format.

UTF-8 is noteworthy because it does not reuse any ASCII codes and
it is possible to tell in mid stream if we are in an extension byte.



This is why I my format was for i18n and was only 8 bits.
The user whose device is UTF-8 would deal with it like a
basic_stream<char> but whenever they had a wchar_t string
they would just code it.

  cout << "hello " << to_utf8(wstring(L"bob"));

The same goes for input.
  string s;
  cin >> s;
  wstring name=from_utf8(s);

Of course having a wide version of format isn't going to hurt
as it would still solve the 3 problems necessary there.

  - prevent broken text strings
  - allow reordering
  - allow non-sticky formatting

Hope that brings you up to speed.


Boost list run by bdawes at, gregod at, cpdaniel at, john at