Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2004-04-14 05:35:19


> > 1) define the data types for 8/16/32 bit Unicode characters.
>
> unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code
> units, and boost::int32_t for UTF-32 code units and to represent
> Unicode code points

Almost, ICU uses wchar_t for UTF-16 on Win32 (just to complicate things).

> > 2) define iterator adapters to convert a sequence of one Unicode
character
> > type to another.
>
> This is easy enough, unicode.org provides optimized C code for this
> purpose, which could easily be changed slightly for the use of iterator
> adapters. Alternatively, ICU probably less directly provides this.

Yep.

> > 3) define char_traits specialisations (as necessary) in order to get
> > basic_string working with Unicode character sequences, typedef the
> > appropriate string types:
>
> > typedef basic_string<utf8_t> utf8_string; // etc
>
> As far as the use of UTF-8 as the internal encoding, I personally would
> suggest that UTF-16 be used instead, because UTF-8 is rather inefficient
> to work with. Although I am not overly attached to UTF-16, I do think
> it is important to standardize on a single internal representation,
> because for practical reasons, it is useful to be able to have
> non-templated APIs for purposes such as collating.

You can use whatever you want - I don't think users should be constrained to
a specific internal encoding. Personally I don't like UTF8 either, but I
know some people do...

> The other issues I see with using basic_string include that many of its
> methods would not be suitable for use with a Unicode string, and it
> does not have something like an operator += which would allow appending
> of a single Unicode code point (represented as a 32-bit integer).
>
> What it comes down to is that basic_string is designed with fixed-width
> character representations in mind.
>
> I would be more in favor of creating a separate type to represent
> Unicode strings.

Personally I think we have too many string types around already. While I
understand you're concerns about basic_string, as a container of code-points
it's just fine IMO. We can always add non-member functions for more
advanced manipulation.

> > 4) define low level access to the core Unicode data properties (in
> > unidata.txt).
>
> Reuse of the ICU library would probably be very helpful in this.
>
> > 5) Begin to add locale support - a big job, probably a few facets at a
> > time.
>
> The issue is that, despite what you say, most or all of the standard
> library facets are not suitable for use with Unicode strings. For
> instance, the character classification and toupper-like operations need
> not be tied to a locale.

Accepted ctype operations are largely (though not completely) independed of
the locale, that just makes the ctype specialisations easier IMO.

> Furthermore, many of the operations such as
> toupper on a single character are not well defined, and rather must be
> defined as a string to string mapping.

I know, however 1 to 1 approximations are available (those in Unidata.txt).
I'm not saying that the std locale facets should be the only interface, or
even the primary one, but providing it does get a lot of other stuff
working.

> Finally, the single-character
> type must be a 32-bit integer, while the code unit type will probably
> not be (since UTF-32 as the internal representation would be
> inefficient).

True, for UTF-16 only the core Unicode subset would be supported by
std::locale (ie no surrogates): this is the same as the situation in Java
and JavaScript.

> Specific cases include collate<Ch>, which lacks an interface for
> configuring collation, such as which strength level to use, whether
> uppercase or lowercase letters should sort first, whether in French
> locales accents should be sorted right to left, and other such features.
> It is true that an additional, more powerful interface could be
> provided, but this would add complexity.

You can provide any constructor interface to the collate facet that you
want, for example to support a locale and a strenth level one might use:

template <class charT>
class unicode_collate : public std::collate<charT>
{
public:
unicode_collate(const char* name, int level = INT_MAX);
/* details */
};

I'm assuming that we have a non-member function to create a locale object
that contains a set of Unicode facets:

std::locale create_unicode_locale(const char* name);

Usage to create a locale object with primary level collation would then be:

std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB",
1));
mystream.imbue(l);
mystream << something;
// etc.

> Additionally, it depends on
> basic_string<Ch> (note lack of char_traits specification), which is used
> as the return type of transform, when something representing a byte
> array might be more suitable.

You might have me on that one :-)

> Additionally, num_put, moneypunct and money_put all would allow only a
> single code unit in a number of cases, when a string of multiple code
> points would be suitable. In addition, those facets also depend on
> basic_string<Ch>.

I don't understand what the problem is there, please explain.

> > 6) define iterator adapters for various Unicode algorithms
> > (composition/decomposition/compression etc).
> > 7) Anything I've forgotten :-)
>
> A facility for Unicode substring matching, which would use the
> collation facilities, would be useful. This could be based on the ICU
> implementation.
>
> Additionally, a date formatting facility for Unicode would be useful.

std::time_get / std::time_put ? :-)

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk