Boost logo

Boost :

Subject: Re: [boost] [rfc] Unicode GSoC project
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2009-05-14 23:11:11


Eric Niebler wrote:
> Mathias Gaunard wrote:

>> I have been working on range adaptors to iterate over code points in
>> an UTF-x string as well as converting back those code points to UTF-y
>> for the past week and
>
> That's good, these are needed. Also needed are tables that store the
> various character properties, and (hopefully) some parsers that build
> the tables directly from the Unicode character database so we can easily
> rev it whenever the database changes.

I will attack the tables as soon as iteration over code values, code
points and grapheme clusters is finished (that would use the tables, but
I want to have the mechanisms defined first), and that there is an
accepted design as to how to represent unicode data and interact with it.
Which hopefully, should not be in too much time.

> The invariant of what? The internal data over which the iterators
> traverse? Which iterators? All of them? Are you really talking about an
> invariant (something that is true of the data both before an after each
> operation completes), or of pre- or post-conditions?

The invariant satisfied by a model of UnicodeRange.
If R is a model of UnicodeRange, any instance of R r shall verify
that begin(r)...end(r) is a valid normalized unicode string properly
encoded in UTF-x.

The invariant must be satisfied at any time.

Functions taking models of UnicodeRange as input can assume the
invariant as a precondition on their input and should have it as a
postcondition on it too.
Functions providing models of UnicodeRange as output should have it as a
postcondition on its output.

> , but maybe
>> that should be something orthogonal. I personally don't think it's
>> really useful for general-purpose text though.
>
> I should hope there is a way to operate on valid Unicode ranges that
> happen not to be in normalization form C.

A way to operate on such data would be normalizing it beforehand. No
information is supposed to be lost by normalizing to form C.
Substring search for example requires the two strings being compared to
be normalized, or at least relevant parts, so that canonically
equivalent things can compare as being equal.
We can choose to do that behind the user's back or rather to make it so
that we don't need to; the latter allows to keep things simple.

Of course, being normalization-form-agnostic and making it a separate
concept, allowing to select the best algorithm possible (I may not have
the time to write all versions however), is more powerful because it
doesn't do any concession.
I just want to know whether it's really worth it to complicate this.

>
> The library provides the following core types in the boost namespace:
>
> uchar8_t
> uchar16_t
> uchar32_t
>
> In C++0x, these are called char, char16_t and char32_t. I think uchar8_t
> is unnecessary, and for a Boost Unicode library, boost::char16 and
> boost::char32 would work just fine. On a C++0x compiler, they should be
> typedefs for char16_t and char32_t.

The character types not being unsigned could be lead to issues during
promotions or conversions.

I also personally think "char" is better to mean "locale-specific
character" than "utf-8 character", so I thought a distinct name for the
type was more appropriate.

Anyway, embracing the standard way is what should be done, I agree. I'll
just have to make sure I'm careful of conversions.

Does Boost has macros to detect these yet?

> And UnicodeGrapheme concept doesn't make sense to me. You say, "A model
> of UnicodeGrapheme is a range of Unicode code points that is a single
> grapheme cluster in Normalized Form C." A grapheme cluster != Unicode
> code point. It may be many code points representing a base character an
> many zero-width combining characters. So what exactly is being traversed
> by a UnicodeGrapheme range?

An UnicodeGrapheme is one grapheme cluster, i.e. a range of code points.

Basically, to iterate over grapheme clusters, a range of code points
would be adapted into a range of ranges of code points.

> The concepts are of critical importance, and these don't seem right to
> me. My C++0x concept-foo is weak, and I'd like to involve many more
> people in this discussion.

In C++0x concept-foo, I would say associated the validity invariant with
a non-auto concept (i.e. the model has to specify it implements the
concept, unlike auto concepts which are implicitly structurally matched
to models), so as to distinguish ranges that aim at maintaining that
invariants from ranges that don't.

> The purpose of the concepts are to allow algorithms to be implemented
> generically in terms of the operations provided by the concepts. So,
> what algorithms do we need, and how can we express them generically in
> terms of concepts? Without that most critical step, we'll get the
> concepts all wrong.

Concepts are not just about operations, but also about semantics.
A single pass range and a forward range provide the same operations, but
they've got different semantics.
A single pass range can be traversed once, a forward range, which
refines it, can be traversed any number of times.

Refining the range concepts to create a new concept meaning a range that
satisfies a given predicate doesn't seem that different in spirit.

The problem is that it is not possible to ensure that predicate
programmatically with range adapters, so we have to fallback to design
by contract.

Now, if it is believed this is a bad idea to model invariants as a
concept, I simply will remove the concept and just deal with raw ranges
directly, without any concept(=invariant) checking.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk