Boost logo

Boost :

From: Nils Springob (nils.springob_at_[hidden])
Date: 2006-09-30 14:24:05

The central difference between ansi strings and a utf8 strings is, that
character access by index is simple for ansi strings but difficult for utf8
encoded strings. std::basic_string can handle utf8, utf16 and utf32 encoded
strings, but there is no access to the decoded string with access to the unicode
values of the characters.

> However it isn't basic_string and it means it is isolated from the rest of
> standard library. In perfect world I would expect to read/write utf_strings
> from std::streams in the same way it is provided for std::string i.e. all the
> operations like operator>>, getline and so on should be usable on
> utf_strings.
It is always possible to access the basic_string<> data by calling raw()! The
standard requires character access, which can't be implemented efficiently for
utf8 and utf16 encoded strings.

> So in this area I basicaly identify with Matt Austern's proposal for the
> C++0x ( ).
I see my approach as an addition to Matt Austern's proposal. While Matt is
handling encoded strings, my approach deals with decoded strings. The encoded
string types are std::string, std::ustring, and std::u32string. These strings
allow access to the raw values of the encoded words. My wrapper allow access to
the strings at an symbolic level. It allows conversion between the different
encodings and also to the unicode values of the characters as char32_t values.

8bit word array -> std::string
16bit word array -> std::ustring
32bit word array -> std::u32string

utf8 encoded strings -> utf8_string (based on std::string)
utf16 encoded strings -> utf16_string (based on std::ustring)
utf32 encoded strings -> utf32_string (based on std::u32string)

but the approach also allows:
latin-1 encoded strings -> latin1_string (based on std::string)
windows-1252 encoded strings -> windows_1252_string (based on std::string)


Boost list run by bdawes at, gregod at, cpdaniel at, john at