Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-07-27 23:24:52


Hi Graham,

On 7/25/05, Graham <Graham_at_[hidden]> wrote:
> [...]
> If we can agree the interface/ separation/ of Unicode character data
> from string interfaces then I believe that we will move forward quickly
> from there as we are then 'just' taking about algorithm optimisation on
> a known data set to create the best possible string implementation.

OK, we agree on this; I was incorrectly lumping things together.

> >> How have you hooked in dictionary word break support for languages
> like
> >> Thai
>
> >IMO that would be beyond the scope of a general Unicode library.
>
> It is both outside the scope but also fundamental to the approach as
> this case must be handled/ provided for.
>
> In my experience this is handled by the dictionary pass [outside the
> scope of this support] adding special break markers into the text [which
> need to be supported transparently as Unicode characters that happen to
> be in the private use range at this level] so that the text and string
> iterators can then be handled normally. The fact that the break markers
> are special characters in the private use range should not be relevant
> or special at this level.

You mean that we invent a set of private characters that the
dictionary pass should use?

> >> How far have you gone? Do you have support for going from logical to
> >> display on combined ltor and rtol ? Customised glyph conversion for
> >> Indic Urdu?
>
> >Correct me if I'm wrong, but I think these issues become important
> >only when rendering Unicode strings. Aren't they thus better handled
> >by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost
> >Unicode library should focus on processing symbolic Unicode strings
> >and keep away from what happens when they are displayed, just like
> >std::basic_string does.
>
> Unfortunately I believe that there may be serious limitations in this
> approach.
>
> I strongly believe that even if we do not actually write all the code we
> must not be in a position where, for example, you have to use a
> Uniscribe library based on Unicode 4 and a Boost library based on
> Unicode 4.1. [This is even ignoring UniScribe's 'custom' handling].
>
> We must provide a Unicode character system on which all libraries can
> operate consistently.
>
> Even working out a grapheme break may require different sets of
> compromises that must work consistently for any set of inter-related
> libraries to be successful.

Do you have an example? I'm having trouble envisioning a situation in
which libraries based on different Unicode versions actually cause
conflicts.

> As another example where display controls data organisation, what
> happens if you want to have a page of text display the same on several
> machines?

Can you elaborate? In what cases is this vital and how does display
influence data organisation?

> This is actually a very difficult thing to do due to limitations in the
> Windows GDI scaling [which is not floating point but 'rounds' scaling
> calculations, and which can result in as much as a +/-10% difference in
> simple string lengths on different machines, unless handled specifically
> e.g. IIIIIIIIIIIIIIII can be the same length as WXWXWX on one machine
> but there can be a 20% difference on another machine] and requires
> access to the data conversion mechanisms and requires that you know how
> you are going to perform the rendering.

I fear I don't understand what you mean. It sounds to me like you're
suggesting defining a new font format for the Boost Unicode library.

> >Why would you want to do canonical decomposition explicitly in a
> >regular expression?
>
> Let me give two examples:
>
> First why?
>
> If you use a regular expression to search for <e acute> - [I use angle
> brackets <> to describe a single character for this e-mail] then
> logically you should find text containing:

And <acute> is a combining acute?

> <e><acute> and <e acute> as these are both visually the same when
> displayed to the user.

> Second why do we need to know? If we decompose arbitrarily then we can
> cover over syntax errors and act unexpectedly:
> [...]

Yes, the Unicode library should by default process grapheme clusters
rather than code points. This would automatically solve the regex
issue.

> >> We will need to create a utility to take the 'raw'/ published unicode
> >> data files along with user defined private characters to make these
> >> tables which would then be used by the set of functions that we will
> >> agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
> >
> >I find the idea of users embedding private character properties
> >_within_ the standard Unicode tables, and building their own slightly
> >different version of the Unicode library, scary. Why is this needed?
>
> It is important that the private use range, which is part of the Unicode
> spec, be handled consistently with the other Unicode ranges otherwise we
> end up having to write everything twice !
>
> The private use range is in the Unicode spec specifically as it has been
> recognised that any complex Unicode system will need private use
> characters.
>
> Classic examples are implementations that move special display
> characters into portions of the private use ranges to allow for optimal
> display of visible tabs, visible cr, special characters like Thai word
> breaks, and of course completely non-standard characters like a button
> that can be embedded in text and would be entirely implementation
> specific. Having the breaking characteristics of these characters be
> handled consistently with all Unicode characters is a massive
> simplification for coding.
>
> I strongly believe that we must therefore allow each developer who wants
> to use the Unicode system the ability to add these private use character
> properties into their own personal main character tables so they are
> handled consistently with all other characters, but acknowledge that
> these are implementation specific.
>
> This private use character data would NOT be published or distributed -
> the facility to merge them in during usage allows each developer the
> access to add their own private use data for their own system only.

But surely this means every app would have to come with a different DLL?
I'm not so sure about this. For many cases other markup (XML or
something) would do. Maybe other people have opinions about this?

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk