Boost logo

Boost :

Subject: Re: [boost] [rfc] Unicode GSoC project
From: Eric Niebler (eric_at_[hidden])
Date: 2009-05-15 12:09:22


Mathias Gaunard wrote:
> Eric Niebler wrote:
>> The invariant of what? The internal data over which the iterators
>> traverse? Which iterators? All of them? Are you really talking about
>> an invariant (something that is true of the data both before an after
>> each operation completes), or of pre- or post-conditions?
>
> The invariant satisfied by a model of UnicodeRange.
> If R is a model of UnicodeRange, any instance of R r shall verify
> that begin(r)...end(r) is a valid normalized unicode string properly
> encoded in UTF-x.
>
> The invariant must be satisfied at any time.

So a range that lazily verifies well-formedness while it is being
traversed cannot satisfy the UnicodeRange concept? I think if we were to
ever have a Unicode string type, it would make sense to define a class
invariant. For the Unicode concepts (which have not yet been specified),
each operation in the concept may specify pre- and post-conditions, in
addition to complexity guarantees and semantics.

> Functions taking models of UnicodeRange as input can assume the
> invariant as a precondition on their input and should have it as a
> postcondition on it too.
> Functions providing models of UnicodeRange as output should have it as a
> postcondition on its output.
>
>
>> , but maybe
>>> that should be something orthogonal. I personally don't think it's
>>> really useful for general-purpose text though.
>>
>> I should hope there is a way to operate on valid Unicode ranges that
>> happen not to be in normalization form C.
>
> A way to operate on such data would be normalizing it beforehand. No
> information is supposed to be lost by normalizing to form C.

That's one way, yes. Another way would be to provide an adaptor that
normalizes on the fly, if such a thing could be done. There may still be
other ways.

> Substring search for example requires the two strings being compared to
> be normalized, or at least relevant parts, so that canonically
> equivalent things can compare as being equal.

Or else Unicode comparison algorithms can be written generally so as not
to require both sequences to be normalized first. It could also dispatch
to a more efficient implementation if the ranges are both known to be in
the same normalization form.

> We can choose to do that behind the user's back or rather to make it so
> that we don't need to; the latter allows to keep things simple.

Are we writing a handful of one-off routines that manipulate /some/
Unicode data in /some/ predefined ways, or are we trying to build a the
foundation upon which a full, generic Unicode library can be based? I'm
hoping for the latter. That's a much bigger job. It's ok if we do the
former, as long as we see it for what it is: the non-generic kernel from
which we can later distill generic algorithms and concepts.

> Of course, being normalization-form-agnostic and making it a separate
> concept, allowing to select the best algorithm possible (I may not have
> the time to write all versions however), is more powerful because it
> doesn't do any concession.

Right.

> I just want to know whether it's really worth it to complicate this.

For GSoC, we may decide not. The first step of the generic programming
process is to write a big pile of non-generic code and start sifting
through it to find the concepts. It's possible that your GSoC project
produces the non-generic code. But it will not be a Boost.Unicode
library until the generic Unicode algorithms have been distilled and the
concepts defined.

>> The library provides the following core types in the boost namespace:
>>
>> uchar8_t
>> uchar16_t
>> uchar32_t
>>
>> In C++0x, these are called char, char16_t and char32_t. I think
>> uchar8_t is unnecessary, and for a Boost Unicode library,
>> boost::char16 and boost::char32 would work just fine. On a C++0x
>> compiler, they should be typedefs for char16_t and char32_t.
>
> The character types not being unsigned could be lead to issues during
> promotions or conversions.
>
> I also personally think "char" is better to mean "locale-specific
> character" than "utf-8 character", so I thought a distinct name for the
> type was more appropriate.
>
> Anyway, embracing the standard way is what should be done, I agree. I'll
> just have to make sure I'm careful of conversions.

Good.

> Does Boost has macros to detect these yet?

BOOST_NO_CHAR16_T
BOOST_NO_CHAR32_T

>> And UnicodeGrapheme concept doesn't make sense to me. You say, "A
>> model of UnicodeGrapheme is a range of Unicode code points that is a
>> single grapheme cluster in Normalized Form C." A grapheme cluster !=
>> Unicode code point. It may be many code points representing a base
>> character an many zero-width combining characters. So what exactly is
>> being traversed by a UnicodeGrapheme range?
>
> An UnicodeGrapheme is one grapheme cluster, i.e. a range of code points.
>
> Basically, to iterate over grapheme clusters, a range of code points
> would be adapted into a range of ranges of code points.

OK, I see. It's a range of code points that demarcates a single, whole
grapheme cluster. This could be useful.

>> The concepts are of critical importance, and these don't seem right to
>> me. My C++0x concept-foo is weak, and I'd like to involve many more
>> people in this discussion.
>
> In C++0x concept-foo, I would say associated the validity invariant with
> a non-auto concept (i.e. the model has to specify it implements the
> concept, unlike auto concepts which are implicitly structurally matched
> to models), so as to distinguish ranges that aim at maintaining that
> invariants from ranges that don't.

Whoa, we're nowhere near talking about auto vs. non-auto C++0x concepts.
We don't even have Unicode algorithms to work from, so we don't know
what form the concepts will take. You're jumping to the end. I repeat:
you don't know what the concepts are and you won't know until you write
some concrete algorithms and try to lift generic implementations from them.

Please have a look at: http://www.generic-programming.org/

>> The purpose of the concepts are to allow algorithms to be implemented
>> generically in terms of the operations provided by the concepts. So,
>> what algorithms do we need, and how can we express them generically in
>> terms of concepts? Without that most critical step, we'll get the
>> concepts all wrong.
>
> Concepts are not just about operations, but also about semantics.
> A single pass range and a forward range provide the same operations, but
> they've got different semantics.
> A single pass range can be traversed once, a forward range, which
> refines it, can be traversed any number of times.

Ahem. I know.

> Refining the range concepts to create a new concept meaning a range that
> satisfies a given predicate doesn't seem that different in spirit.
>
> The problem is that it is not possible to ensure that predicate
> programmatically with range adapters, so we have to fallback to design
> by contract.

Show me the (concrete) code.

> Now, if it is believed this is a bad idea to model invariants as a
> concept, I simply will remove the concept and just deal with raw ranges
> directly, without any concept(=invariant) checking.

Now we're talking. Ditch the concepts. Write the algorithms. Then try to
make them generic and *find* the concepts.

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk