Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2004-10-19 11:32:50


----- Original Message -----
From: "Rogier van Dalen" <rogiervd_at_[hidden]>

> I've recently started on the first draft of a Unicode library.
>

Interesting. Is there a discussion going about this library that I have
missed, or haven't you posted anything about it yet? I'd hate to start
something like this, if there is already being made an effort on the
subject.

> An assumption I think is wrong is that wchar_t would be suitable for
> Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on
> Microsoft compilers, for example. The utf8_codecvt_facet
> implementation will on these compilers cut off any codepoints over
> 0xFFFF. (U+1D12C will come out as U+D12C.)
>
I agree. The "unicode is wide strings" assumption is wrong in my opinion,
and I would stribe to provide a correct implementation based on the Unicode
standard if I were to go ahead with this.

> I think a definition of unicode::code as uint32_t would be much
> better. Problem is, codecvt is only implemented for wchar_t and char,
> so it's not possible to make a Unicode codecvt without manually adding
> (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to
> the std namespace. I guess this is the reason that Ron Garcia just
> used wchar_t.
>
I don't really feel locking the code unit size to 32bits is a good solution
either as strings would then become unneccesarily large. In a test
implementation I have recently made, I templated the entire encoding scheme
(using an encoding_traits class) and made a common interface for strings
that lets you iterate over the code points it controls, no matter what the
underlying encoding is. (I will post another message with more details of
this library.) This does of course make for problems with other parts of the
standard, but solutions to these problems is what I want my thesis to be all
about.

> About Unicode strings:
> I suggest having a codepoint_string, with the string of code units as
> a template parameter. Its interface should work with 21 (32) bits
> values, while internally these are converted to UTF-8, UTF-16, or
> remain UTF-32.
> template <class CodeUnitString> class codepoint_string {
> CodeUnitString code_units;
> // ...
> };
>
> The real unicode::string would be the character string, which uses a
> base character with its combining marks for its interface.
> template <class CodePointString> class string {
> CodePointString codepoints;
> // ...
> };
>
> So unicode::string<unicode::codepoint_string<std::string> > would be a
> UTF8-encoded string that is manipulated using its characters.
>
> unicode::string should take care of correctly searching for a
> character string, rather than a codepoint string.
>

Thanks. I will take that into consideration. I'm glad to hear any
design/implementation ideas since I want this library to be useable for the
largest amount of people possible.

> operator< has never done "the right thing" anyway: it does not make a
> difference between uppercase and lowercase, for example. Probably,
> locales should be used for collation. The Unicode collation algorithm
> is pretty well specified.
>

Yes. I hope to be able to add support for the collation algorithm to enable
proper, locale specific collation.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk