Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2004-10-19 18:11:48


> I am pretty sure you mean abstract character here, not code unit. My
> understanding of the Unicode terminology is that the decomposed version of
> ü
> consists of
>
> one abstract character (ü)
> two encoded characters (u, ¨)
> two UTF-32 code units (0x00000075 0x00000308)
> two UTF-16 code units (0x0075 0x0308)
> three UTF-8 code units (0x75 0xCC 0x88)
>
> but perhaps I have it backwards...

No. You are correct about that. I don't know what I was talking about. This
is another example of me talking before I think! ;) I think we argee on
this, but are just misunderstanding each other.

Anyhoo... To answer this again: :)

> Again, taking this example, you let's say that do_some_operation performs
> canonicalization to some Unicode canonical form; you can't do this by
> iterating
> over code points.

No you can't do that with code point iterators, but I am pretty sure you
couldn't do it with an abstract character iterator either. (Or any kind of
iterator for that matter) The process of canonicalization (I'm assuming you
are talking about canonical decomposition here) involves splitting one code
point into multiple code points if that is possible. (ü would be splitted
into u and ¨ as you say) That means that the do_some_operation would need to
insert code points into the string it is iterating over, something that
would take some "hacking" to do inside a normal iterator interface.

Abstract character iterators are no better. The concept of abstract
characters is oblivious to the code unit differences between these
representations, and iterating over abstract characters (I'm not sure how
this would even be done) would not reveal the underlying composition of code
points needed for canonical decomposition to be performed.

Ultimately I feel that the operation of normalization (which involves
canonical decomposition) of unicode strings should be hidden from the user
completely and be performed automatically by the library where that is
needed. (Like on a call to the == operator.) I think that solution would be
satisfactory for most users as the normalization process is somewhat
intricate and really not something users should be forced to understand.

Are we at all on the same page now?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk