Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-19 05:27:52


I've recently started on the first draft of a Unicode library.

An assumption I think is wrong is that wchar_t would be suitable for
Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on
Microsoft compilers, for example. The utf8_codecvt_facet
implementation will on these compilers cut off any codepoints over
0xFFFF. (U+1D12C will come out as U+D12C.)

I think a definition of unicode::code as uint32_t would be much
better. Problem is, codecvt is only implemented for wchar_t and char,
so it's not possible to make a Unicode codecvt without manually adding
(dummy) implementations of codecvt<unicode::code,char,mbstate_t> to
the std namespace. I guess this is the reason that Ron Garcia just
used wchar_t.

About Unicode strings:
I suggest having a codepoint_string, with the string of code units as
a template parameter. Its interface should work with 21 (32) bits
values, while internally these are converted to UTF-8, UTF-16, or
remain UTF-32.
template <class CodeUnitString> class codepoint_string {
    CodeUnitString code_units;
    // ...
};

The real unicode::string would be the character string, which uses a
base character with its combining marks for its interface.
template <class CodePointString> class string {
    CodePointString codepoints;
    // ...
};

So unicode::string<unicode::codepoint_string<std::string> > would be a
UTF8-encoded string that is manipulated using its characters.

unicode::string should take care of correctly searching for a
character string, rather than a codepoint string.

operator< has never done "the right thing" anyway: it does not make a
difference between uppercase and lowercase, for example. Probably,
locales should be used for collation. The Unicode collation algorithm
is pretty well specified.

Hope all this is clear...
Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk