Boost logo

Boost :

Subject: Re: [boost] [rfc] Unicode GSoC project
From: Eric Niebler (eric_at_[hidden])
Date: 2009-05-14 13:09:20


Mathias Gaunard wrote:
> Hi everyone. I'm in charge of the Unicode Google Summer of Code project.
>
> I have been working on range adaptors to iterate over code points in an
> UTF-x string as well as converting back those code points to UTF-y for
> the past week and

That's good, these are needed. Also needed are tables that store the
various character properties, and (hopefully) some parsers that build
the tables directly from the Unicode character database so we can easily
rev it whenever the database changes.

> I stopped working on these for a bit to put together some short
> documentation (which is my first quickbook document, so it may not be
> very pretty).
> This is not a documentation of the final work, but rather that of what
> I'm working on at the moment.
>
> I would like to know everyone's opinion of the concepts I am defining,
> which assume the range that is being worked on is indeed a valid unicode
> range in a particular encoding, as well as the system used to enforce
> those concepts.
>
> Also, I put the normalization form C as part of the invariant

The invariant of what? The internal data over which the iterators
traverse? Which iterators? All of them? Are you really talking about an
invariant (something that is true of the data both before an after each
operation completes), or of pre- or post-conditions?

, but maybe
> that should be something orthogonal. I personally don't think it's
> really useful for general-purpose text though.

I should hope there is a way to operate on valid Unicode ranges that
happen not to be in normalization form C.

> While the system doesn't provide conversion from other character sets,
> this can easily be added by using assume_utf32. For example, using an
> ISO-8859-1 string as input to assume_utf32 just works, since ISO-8859-1
> is included verbatim into Unicode.

I personally haven't taken the time to learn how ICU handles Unicode
input and character set conversions. It might be illustrative to see how
an established and respected Unicode library handles issues like this.

> The documentation contains as well some introductory Unicode material.
>
> You can find the documentation online here:
> http://mathias.gaunard.emi.u-bordeaux1.fr/unicode/doc/html/

Thanks for posting this. Some comments.

<<Core Types>>

The library provides the following core types in the boost namespace:

uchar8_t
uchar16_t
uchar32_t

In C++0x, these are called char, char16_t and char32_t. I think uchar8_t
is unnecessary, and for a Boost Unicode library, boost::char16 and
boost::char32 would work just fine. On a C++0x compiler, they should be
typedefs for char16_t and char32_t.

<<Concepts>>

I strongly disagree with requiring normalization form C for the concept
UnicodeRange. There are many more valid Unicode sequences.

And UnicodeGrapheme concept doesn't make sense to me. You say, "A model
of UnicodeGrapheme is a range of Unicode code points that is a single
grapheme cluster in Normalized Form C." A grapheme cluster != Unicode
code point. It may be many code points representing a base character an
many zero-width combining characters. So what exactly is being traversed
by a UnicodeGrapheme range?

The concepts are of critical importance, and these don't seem right to
me. My C++0x concept-foo is weak, and I'd like to involve many more
people in this discussion.

The purpose of the concepts are to allow algorithms to be implemented
generically in terms of the operations provided by the concepts. So,
what algorithms do we need, and how can we express them generically in
terms of concepts? Without that most critical step, we'll get the
concepts all wrong.

I imagine we'll want algorithms for converting from one encoding to
another, or from one normalization form (or, more likely, from no
normalization form) to another, so we'll need to constrain the
algorithms to specific encodings and/or normalization forms. We'll also
need a concept that represents Unicode input that hasn't yet been
normalized (perhaps in each of the encodings?). Point is, the concrete
algorithms must come first. We may end up back with a single perfectly
general UnicodeRange that all algorithms can be implemented in terms of.
  That'd be nice, but I bet we end up with refinements for the different
encodings/normalized forms that make it possible to implement some
algorithms much more efficiently.

(I stopped reading the docs at this point.)

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk