Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-04-07 04:38:35


In article <c50ghn$9fd$1_at_[hidden]>, Vladimir Prus <ghost_at_[hidden]> wrote:

> >> I wonder what's the right abstraction then? Is it necessary to have a
> >> class to represent abstract character, with all composing characters?
> >
> > That's one way to go, yes; note that the moment you utter those words, you
> > put yourself into the position of designing a Unicode API :-) which you
> > said you don't want to do at this time.
>
> You almost caugth me ;-) I've changed the message subject on purpose -- to
> indicate that I'm not longer talking about program_options.
> I'm interested how 'right' unicode string can be implemented, but I don't
> think sure it's possible to design such a string now, so program_options
> will still have to use much simpler approach.

I am somewhat reluctant to discuss this in detail at this time, not because I
have something to hide, but because I have something to learn: I need to
investigate some aspect of Unicode, the ICU library, and locales and facets in
the C++ standard before I can form a more complete picture of the design of a
Unicode string. However, I don't have the time to do all the research right now,
because there are other things I need to do that I am getting paid to do, and
full Unicode support is not on my work too list. Basically, I know enough to
know how _not_ to do it, but I am not sure that I know enough to know how to do
it right :-)

However, I currently think that there are legitimate reasons why one would want
to view a Unicode string as (in increasing order of complexity):

 - a sequence of code points (this is useful for serialization)
 - a sequence of encoded characters (this is useful for transcoding)
 - a sequence of abstract characters (this is useful for most high-level string
transformations, such as substrings, find, etc.)

Therefore I think that a Unicode string should probably not be represented as a
container of any one of those three, but instead should have an interface that
lets you treat it in different ways depending on your needs. (One way to do this
is to have three kinds of iterators for Unicode strings).

Also, as I mentioned elsewhere, Unicode strings do not lend themselves to
performance and iteration characteristics provided by std::string; in
particular, constant time random access is not going to work for two of those
three views of a string. I think that a Unicode string is much better matched to
characteristics of SGI's rope class, but I haven't had the time to research that
in detail.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk