|
Boost : |
From: Graham (Graham_at_[hidden])
Date: 2005-07-22 17:43:58
Hi,
I am considering creating a set of Unicode support classes for boost
that would add native Unicode support to boost and would like to find
out if there is any interest in either using it or helping to write it.
1. unistring
This would store UTF16 logically ordered data and allow Unicode actions
and iterations.
It would be largely similar the basic_string except the iterators
supported would be forward only and would be arranged in the following
class hierarchy:
Data
UTF16 - const iterator
Grapheme
Word
Sentence
Line
Identifier
Data: WORDs of data
*Data Iterator: WORD
UTF16: UTF16 encoded words [i.e. includes surrogates which are two
WORDs]
*UTF16 Iterator: const WORD
Grapheme: What appears to the user as a character but may consist of
several UTF16 encoded WORDs [e.g. e followed by acute].
*Grapheme Iterator: unichar
Word: Word breaks [n.b. this would not work in languages like Thai where
that cannot be computed from the characters].
*Word iterator: unistring [return portion of string as new and
independent string]
Line: Line breaks [n.b. ditto Thai ...]
*Line iterator: unistring [return portion of string as new and
independent string]
Identifier: Language parsing iterator
* Identifier iterator: unistring [return portion of string as new and
independent string]
All iterators would support basedata() which would convert the iterator
to a Data iterator which can then have const cast off as required.
The unistring would support equality (=) but not equivalence (<).
Equality would perform canonical decomposition on the fly and compare
decomposed data.
Equivalence would be supported by a separate class called unistringsort.
The following methods would be supported:
empty
size
decompose
tolower
toupper
assignUFT7
assignUTF8
assignUTF16
assignUTF32
insertUFT7
insertUTF8
insertUTF16
insertUTF32
find
find_if
replace
replace_if
others ...
2. unistringsort
unistringsort would have two members, a const unistring and a const
vector<WORD> of sort data [4 words per Unicode character].
Equality (=) would not be supported.
Equivalence (<) would be supported which would do a level 4 compare on
the sort data.
Sort level 4 is the level used for display sorting of strings. Other
levels allow ignoring case accents etc.
The following methods would be supported:
equals(const unistringsort & other, sortlevel level) would allow other
sort levels (1, 2, 3, and 4) to be tested.
const unistring& string()
const vector<WORD>* data()
Note: Unicode sorting is complex and involves sort decomposition,
canonical decomposition, mathematical expansion of characters using a
predetermined table and rules to 4 words per character that are then
parsed in priority order.
3. unichar
A unichar data type would also be created to allow Unicode tests on
individual characters which will be DWORDs for ease due to the fact that
Unicode characters are 21 bits.
This would allow testing of:
isstrongright
isright
isleft
isstrongleft
isleftjoining
isrightjoining
isbothjoining
isnumeric
etc...
The implementation would require a few of Meg of data files created from
the current Unicode release and my initial thoughts are that the
following files would need to be parsed:
allkeys.txt
CaseFolding.txt
GraphemeBreakTest.txt
LineBreak.txt
SentenceBreakTest.txt
SpecialCasing.txt
UnicodeData.txt
WordBreakTest.txt
The data files created would be header files containing native c arrays
containing specially organised data to support all the above
functionality.
A program would be created to produce these files from the source
automatically from an up to date Unicode release.
The ranges in Blocks.txt would need to be hard coded to support
functions like ishangul etc.
Questions:
1. Is this worth doing are enough people interested to make it
worthwhile?
2. Which would be the best implementation of basic_string to use?
3. Should unistring support equality due to the overhead in
decomposition or should there be a decompunistring?
4. Do we need other classes or to include functionality such as the
ability to convert to display order?
5. What other methods are required on the classes?
6. Any comments ?
Yours,
Graham Barnett
BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk