Boost logo

Boost :

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2006-07-06 06:51:17


Sean Parent wrote:
> What we do need - are good standard algorithms which can be applied
> to any string class.

Agreed. This is where Boost.StringAlgorithms come in.

What I am interested in is *efficient* codepage -> codepage conversion.
For example, I may want to read a file in that is stored as 8-bit ASCII as
UTF8. Likewise, I may want to take UTF8 data and save it as UTF16,
MacRoman or some other encoding.

What you need is an encoding -> UTF32 converter and a UTF32 ->
encoding converter. Ideally, I would like each of these to be as
efficient as possible. They should also be able to accept partial data.
That is, if I am reading in a UTF8 file in blocks, it is possible to hit
the middle of a character. This can be solved in a random access
stream by seeking to the previous character, but this is not always
possible. Consider:

   std::basic_ostringstream< uchar32_t > utf;
   utf.set_locale( utf8_to_utf32_cvt()); // not sure on exact code here

   std::string utf8 = some_utf8_data();
   std::copy( utf8.begin(), utf8.end(), stream_inserter( utf ));

The problem with this is that the stringstream type is uchar32_t, but
has an *input* character type of char.

The conversion mappings are of the form:

   n source characters -> m destination characters

where these may be encoded byte sequences (e.g. UTF8),
surrogate pairs (e.g. UTF16) or combining characters (e.g.
a + umlaut).

I am not sure how good locales are for this kind of functionality
and also how good C++ streams are for this. However, it would
be nice to have a stream interface (i.e. << and >>).

- Reece
_________________________________________________________________
Be one of the first to try Windows Live Mail.
http://ideas.live.com/programpage.aspx?versionId=5d21c51a-b161-4314-9b0e-4911fb2b2e6d


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk