|
Boost : |
Subject: Re: [boost] [rfc] Unicode GSoC project
From: Graham (Graham_at_[hidden])
Date: 2009-05-14 19:04:28
Dear Eric/ Mathias,
>That's good, these are needed. Also needed are tables that store the
various
>character properties, and (hopefully) some parsers that build the
tables >directly from the Unicode character database so we can easily
rev it >whenever the database changes.
A good reloadable character library is in the vault.
>And UnicodeGrapheme concept doesn't make sense to me. You say, "A model
of >UnicodeGrapheme is a range of Unicode code points that is a single
grapheme >cluster in Normalized Form C." A grapheme cluster != Unicode
code point. It >may be many code points representing a base character an
many zero-width >combining characters. So what exactly is being
traversed by a >UnicodeGrapheme range?
>It is thus important to be able to apply algorithms with graphemes as
the
>unit rather than code points to deal with graphemes not representable
by a >single code point.
I think that a grapheme is more of an iterator concept than a data type
concept. By specialising it you will unnecessarily complicate any
library. Don't forget that, for example, the current grapheme may start
as one character, then suddenly 'grab' the surrounding characters as it
makes a combined glyph.
I have never found a use case in practise where specialising the
grapheme as other than a validated series of code points was helpful.
The two cases where graphemes are important is in display [which
requires intermediate glyph conversion anyway, and works just as well on
runs of code points, so code points are fine] and in editing - and the
grapheme-ness here alters during typing.
>The Unicode standard also specifies various features such as a
collation >algorithm in Technical Standard #10 - Unicode Collation
Algorithm for >comparison and ordering of strings with a locale-specific
criterion, as >well as mechanisms to iterate over words, sentences and
lines
Have a look at the character library that I posted in the vault - if you
can do graphemes then you can do words, paragraphs etc as they are all
just attributes of the characters with simple rules. Graphemes come in
to their own for text display and editing and you would need these as
well to be able to support that.
Don't forget that windows GDI only supports point arithmetic and this
means that you need to be able to locate word boundaries to display text
well at different scales to work around the GDI scaling rounding [and
GDI+ is not much better].
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk