|
Boost : |
From: Karl Nelson (kenelson_at_[hidden])
Date: 2000-08-23 11:40:47
> Just one question: can someone fill me in on UTF-8? What is it, and what are
> the differences from ASCII? Is it just 7-bit ASCII plus one extra bit?
Darin already answered this, but here is the chart for those who are lazy.
-------------------------------------------------------------------------
Common formats
--------------
UCS-2 - 16 bit (wchar_t) format
UCS-4 - 32 bit (int) format
UTF - UCS Transformation Format (packs 32 bits into 8 bit multilength code)
UTF-1 - (obsolete) Intended for sending through 7 bit mail exchangers
UTF-7 - a 7-bit shift code for mail exchangers (reuses ASCII)
UTF-8 - Transform from UCS-4 to 8 bit code.
UTF-16 - 16 bit code for packing UCS-4 into UCS-2 (using researved codes)
Overview of UTF-8.
-------------------
The transform is defined as this.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
UCS-4 codes should pack into the smallest available format.
UTF-8 is noteworthy because it does not reuse any ASCII codes and
it is possible to tell in mid stream if we are in an extension byte.
RFC
UTF-8 http://www.faqs.org/rfcs/rfc2279.html
UTF-7 http://www.faqs.org/rfcs/rfc2152.html
UTF-16 http://www.faqs.org/rfcs/rfc2781.html
-----
This is why I my format was for i18n and was only 8 bits.
The user whose device is UTF-8 would deal with it like a
basic_stream<char> but whenever they had a wchar_t string
they would just code it.
cout << "hello " << to_utf8(wstring(L"bob"));
The same goes for input.
string s;
cin >> s;
wstring name=from_utf8(s);
Of course having a wide version of format isn't going to hurt
as it would still solve the 3 problems necessary there.
- prevent broken text strings
- allow reordering
- allow non-sticky formatting
Hope that brings you up to speed.
--Karl
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk