Boost logo

Boost :

From: James Porter (porterj_at_[hidden])
Date: 2007-09-26 17:57:55


I think we could use the locale/code conversion functionality available
in the standard I/O streams library to minimize the amount of new code
needed and to make it more, well, standard. In general, I'd expect most
code conversions to be occurring during I/O anyway (exceptions to this
could probably be handled using stringstreams). Appendix D of "The C++
Programming Language" has a fair amount of information on the topic
(online here: http://www.research.att.com/~bs/3rd_loc0.html )

The I/O streams' code conversion (through std::codecvt) can potentially
convert between any two encodings/character sets, assuming code is
written for that particular conversion. std::codecvt takes 3 template
parameters: internal character encoding, external encoding, and
conversion scheme (called "state"). We could specialize this to take 4
parameters, replacing the single conversion scheme with a pair: one from
the internal encoding to the character set itself, and one from the
character set to the external encoding. So something like this:

   std::codecvt< utf16,utf8,pair<utf16_to_ucs4,ucs4_to_utf8> >

would convert an internal UTF-16 encoding of a string to an external
UTF-8 encoding.

However, an I/O stream can only have one codecvt instance at a time (via
imbuing a locale), so this raises the question of how we should handle
streaming out two Unicode strings with different encodings.

On a different note, does anyone see a practical use in having (mutable)
strings with variable-width character encodings? I can't think of any
practical use for them that wouldn't be equally well-served with an
array of bytes (like the email MIME-type example).

As for run-time tagging of strings, I doubt it would work very well,
since it would be difficult to extend a run-time tagged string class to
handle new encodings/character sets.

- James

Phil Endecott wrote:
> I would definitely encourage breaking the work up into smaller chunks.
> IMHO "smaller is better" for Boost libraries; there have been a number
> of occasions when I've discovered that a feature I want is hidden as an
> internal component of a Boost library, and I've felt that it should
> have been a stand-alone public entity. So let's think about how this
> work can be split up:
>
> - A charset_trait class. I have started on this. The missing piece is
> a way to look up traits of character sets that are known at run-time;
> input would be appreciated.
>
> - Compile-time and run-time tagged strings. The basics of this are
> straightforward and done.
>
> - Conversions. My approach at present is to use iconv via a functor
> that I wrote a while ago. I believe iconv is widely available;
> however, some implementations may support only a small set of character
> sets. Alternatives would be interesting.
>
> - Variable width iterators, including the issue that you raised above.
>
> - Interaction with locales, internationalisation, and system APIs.
>
> and no doubt more. Thinking about the interfaces between these areas
> and the user would be a good place to start.
>
>
> Regards,
>
> Phil.
>
>
>
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk