Boost :

Date view	Thread view	Subject view	Author view

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-10-19 16:07:08

Next message: Beman Dawes: "Re: [boost] Any interest in adding unicode support to boost?"
Previous message: Edward Diener: "[boost] Re: Any interest in adding unicode support to boost?"
In reply to: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Next in thread: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Reply: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"

In article <cl3nps$4d8$1_at_[hidden]>, "Erik Wien" <wien_at_[hidden]> wrote:

> Hi. Thanks for the feedback!

My pleasure :-)

> "Miro Jurisic" <macdev_at_[hidden]> wrote in message
> news:macdev-BACD3C.13585519102004_at_sea.gmane.org...
> > I generally agree with this design approach, but I don't think that code
> > point iterators alone are sufficient.
>
> Neither do I as the matter a fact, but this is as far as I have come right
> now. :) There would probably be different types of iterators (or iterator
> wrappers) made available to enable iterations over everything from code units
> to code points/abstract characters.

Yes, I agree.

> > Iteration over encoded characters and abstract characters would be needed
> > for some algorithms to function sensibly. For example, the simple task of:
> >
> > find(begin, end, "ü")
> >
> > needs to use abstract characters in order to be able to find precomposed
> > and decomposed versions of ü.
> >
>
> True... And this is a point where implemtation would be less than trivial.

Yeah, that's how far I got before I decided that I didn't have the time to deal
with the problem given my current schedule.

> > Again, taking this example, you let's say that do_some_operation performs
> > canonicalization to some Unicode canonical form; you can't do this by
> > iterating over code points.
>
> Nope. A code unit iterator would be needed for things like that.

I am pretty sure you mean abstract character here, not code unit. My
understanding of the Unicode terminology is that the decomposed version of ü
consists of

one abstract character (ü)
two encoded characters (u, ¨)
two UTF-32 code units (0x00000075 0x00000308)
two UTF-16 code units (0x0075 0x0308)
three UTF-8 code units (0x75 0xCC 0x88)

but perhaps I have it backwards...

> The implementation described here would not pose too much of a problem, I was
> thinking more of the problems that arise when you take things like collation
> and locales into consideration. From what i understand there is a real issue
> in enabling proper unicode support in the standard classes like locale, ctype
> and collate, as they assume things that do not neccesarily apply to a unicode
> representation of text. A failiure to enable good support in those classes
> (at least locale and ctype), would also make the iostream support break, and
> things start to snowball. I could very well be wrong on this (Actually, I
> hope I am! :) ), as I haven't had the time to read up on all issues
> concerning this. But again, this is one of many problems I hope running this
> project will help reveal.

I don't know enough about locales to comment on this, unfortunately.

meeroh

Next message: Beman Dawes: "Re: [boost] Any interest in adding unicode support to boost?"
Previous message: Edward Diener: "[boost] Re: Any interest in adding unicode support to boost?"
In reply to: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Next in thread: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Reply: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk