|
Boost : |
From: Reece Dunn (msclrhd_at_[hidden])
Date: 2006-07-06 06:51:17
Sean Parent wrote:
> What we do need - are good standard algorithms which can be applied
> to any string class.
Agreed. This is where Boost.StringAlgorithms come in.
What I am interested in is *efficient* codepage -> codepage conversion.
For example, I may want to read a file in that is stored as 8-bit ASCII as
UTF8. Likewise, I may want to take UTF8 data and save it as UTF16,
MacRoman or some other encoding.
What you need is an encoding -> UTF32 converter and a UTF32 ->
encoding converter. Ideally, I would like each of these to be as
efficient as possible. They should also be able to accept partial data.
That is, if I am reading in a UTF8 file in blocks, it is possible to hit
the middle of a character. This can be solved in a random access
stream by seeking to the previous character, but this is not always
possible. Consider:
std::basic_ostringstream< uchar32_t > utf;
utf.set_locale( utf8_to_utf32_cvt()); // not sure on exact code here
std::string utf8 = some_utf8_data();
std::copy( utf8.begin(), utf8.end(), stream_inserter( utf ));
The problem with this is that the stringstream type is uchar32_t, but
has an *input* character type of char.
The conversion mappings are of the form:
n source characters -> m destination characters
where these may be encoded byte sequences (e.g. UTF8),
surrogate pairs (e.g. UTF16) or combining characters (e.g.
a + umlaut).
I am not sure how good locales are for this kind of functionality
and also how good C++ streams are for this. However, it would
be nice to have a stream interface (i.e. << and >>).
- Reece
_________________________________________________________________
Be one of the first to try Windows Live Mail.
http://ideas.live.com/programpage.aspx?versionId=5d21c51a-b161-4314-9b0e-4911fb2b2e6d
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk