Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2005-07-24 17:19:56


Message: 14

Date: Sun, 24 Jul 2005 20:14:29 +0200

From: Erik Wien <wien_at_[hidden]>

Subject: Re: [boost] Call for interest for native unicode character

      and string support in boost

To: boost_at_[hidden]

Message-ID: <dc0lpu$ej0$1_at_[hidden]>

Content-Type: text/plain; charset=ISO-8859-1; format=flowed

 

 

Dear Erik,

 

I have done extensive Unicode work, and would be very interested to see
what you have done.

 

I apologise if I ask a lot of questions but these are the questions that
immediately spring to mind on any Unicode implementation.

 

 

How have you organised the Unicode 4.1 data ? Optimising this is itself
a project and I have gone through a number of iterations. Unfortunately
I have no definitive answer as you ultimately trade off speed against
data size.

 

Have you allowed for updating against the current Unicode standard from
the Unicode data files?

 

Have you made any trade-off of border [grapheme etc] detection for a
simplification of character data ?

 

Have you given access to the character properties?

 

Have you added Unicode sorting and done it in such a way as to not get a
high performance hit, e.g. having a separate pair<sort data, string>
class?

 

Have you stopped equivalence or equality on a Unicode string?

 

What was your trade off on canconical decomposition?

 

How have you hooked in dictionary word break support for languages like
Thai or have you just built in support for adding private characters so
that you can support force work break, force no word break, and detected
word break [to name just three necessary private characters - in this
case for Thai].

 

How far have you gone? Do you have support for going from logical to
display on combined ltor and rtol ? Customised glyph conversion for
Indic Urdu?

 

Should these discussions be in a separate mailing group?

 

How can we ensure that other boost projects understand the implication
of Unicode support and the subtle changes required, e.g. hooks to allow
for canonical decomposition on string data portions of regular
expressions in the regexpr project?

 

 

I am open to suggestions as to the best way to proceed as I feel that
there are so many factors that can be traded against each other that
there must be some flexibility in the design to allow for speed or
memory optimised designs and this will need a lot of careful thought
from different informed view points.

 

My feeling is that the first step must be to agree the organisation of
the data tables that are parsed from the Unicode data to allow for
character tests, upper/ lower case conversion, sort conversion etc.

There must be agreement on how best to organise these for speed or size,
and what character tests are required.

 

Until we can perform test on Unicode characters and have the basics,
getting into Unicode strings is jumping the gun !

 

We will need to create a utility to take the 'raw'/ published unicode
data files along with user defined private characters to make these
tables which would then be used by the set of functions that we will
agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.

 

I think that this would allow us to progress in a way where we can build
the foundations and then build everything else on top of them, and which
would then allow the standard to be directly related to and updated from
the Unicode standard itself.

 

What do you think?

 

Yours,

 

Graham Barnett

BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

 


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk