Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: James Porter (porterj_at_[hidden])
Date: 2008-11-19 17:27:07


Zach Laine wrote:
> I would love to see a Unicode support library added to Boost.
> However, I question the usefulness of another string class, or in this
> case another hierarchy of string classes. Interoperability with
> std::string (and QString, and CString, and a thousand other
> API-specific string classes) is always thorny. I'd much rather see an
> iterators- and algorithms-based approach, along the lines of your
> ct_string::iterator.

It might get equally thorny just trying to get the algorithms to
recognize all the strange varieties of strings out there without writing
iterator facades for the lot of them! It's probably possible, but I'm
not I'd want it to be the primary interface for encoding. Most custom
string types (both QString and CString, for instance) are designed to
work with only one encoding (UTF-16 seems popular), so if you had some
reason that you needed to store your strings in UTF-8, or - god forbid -
Shift-JIS, you'd be out of luck.

This is especially important when you're reading in arbitrary data whose
encoding you don't know at compile-time. If someone sends me a message
encoded in Shift-JIS and I want to forward it on, I don't want to have
to decode it into UTF-8 and then re-encode it into Shift-JIS before I
send it; I just want to store it in Shift-JIS.

> Instead of doing this:
>
>> baz.encode(bar,rt::utf8);
>
> I'd rather be able to do something like this:
>
> typedef std::basic_string<some_32bit_char_type> unicode_string;
>
> unicode_string u_string = /*...*/;
> std::string std_string = /*...*/;
>
> typedef boost::recoding_iterator<boost::ucs4, boost::utf8> ucs4_to_utf8_iter;
> std::copy(ucs4_to_utf8_iter(u_string.begin()),
> ucs4_to_utf8_iter(u_string.end()), std::back_inserter(std_string));

std::strings aren't really appropriate for this purpose, at least not
without a lot of changes to their interface, since they're designed for
compile-time-tagged, fixed-width-encoding strings. In your examples, you
have to remember what the source encoding is. This is easy enough if you
know that "all my strings are in UTF-8", but if you start working with
runtime-tagged strings (see my Shift-JIS example above), you'd need to
keep track of every encoding in use.

- Jim


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk