Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2005-08-22 08:32:54


>From: Daryle Walker <darylew_at_[hidden]>

Dear Daryle,

>I thought of these functions while considering how Wave process the various
>phases of C++ translation (see section 2.1 of the standard). I wanted the
>conversion to be one native-character to one code-point because that is how
>Phase 1 implies it[1]. If you don't think that's right, maybe we should
>file a defect with the Standard committee.

Maybe I misunderstood you - you seem to be asking for code page conversion here not UTF conversion.
Code page conversion is anything but simple and requires data conversion tables for many of the code pages. It is not very practical at the character level as you can see from the calls below, but would require conversion objects to be created that can then implement these functions.

int_fast32_t char_to_Unicode(std::string locale, char c ); // very inefficient
int_fast32_t wchar_to_Unicode(std::string locale, wchar_t c ); // very inefficient

conv1250 = new convertor(std::string locale); // better
convertor->char_to_Unicode(char c);

>[1] In other words, any extended native character (i.e. not a character C++
>uses for parsing) must be mapped to one C++ Unicode name, which maps to a
>single code-point.

That depends on what you mean. For example <e><acute> can be one <e acute>, or two <e><acute> Unicode characters depending on how it is normalised.
A character parser should understand this if it wants to present Unicode graphemes which are the default unit of parsing in unicode and which should be in any compliant native handler even if working with the local code page.

I suggest you wait for the release of the unicode library, alternatively if you want to volunteer to write lots of code page conversion functions, and their data tables where necessary, please let us know.

Yours,

Graham




Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk