|
Boost : |
Subject: Re: [boost] [string] Realistic API proposal
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-30 02:46:56
>
> If my strings are valid and normalized, I can compare them with a simple
>binary-level comparison;
> likewise for substring search, where I may also need to add a boundary check
>if I want fine-grain search.
>
No you can't
For example when you search word ש××× you want to find שָ××Ö¹× as well (with
diactrics)
that are not normalized.
Search and Collation require much more complicated levels comparison.
>
> To get correct behaviour when comparing strings, they should be normalized.
> Normalization is costly, so you don't want to do it at each comparison, but
>only once.
> In practice, all data available everywhere should already be in NFC (XML
>mandates it, for example)
> and checking whether a string is normalized is very fast (while less fast
>than checking if a
> string is valid UTF-8, since you still need to access a table, which might
>hurt the cache, and
> is not vectorizable).
>
> Dealing with potentially invalid UTF strings can be highly dangerous as well,
>exploits for that
> kind of thing are common-place.
> I suspect denormalized Unicode could be sensitive too, since in some parts of
>your
> application 00e0 (Ã ) and 0061 0300 (a + `) could compare equal but not in
>others, depending on what
> that string went through, causing inconsistencies.
>
The problem that I may want 00e0 (Ã ) and 0061 0300 (a + `) and 0061 (a) to be
equal for string
search as well.
I agree that normalization makes things simpler but in many real world
situations
it is just not the case.
In any case I agree that most of the algorithms may and should be external, but
sometimes it is just convenient to have then withing the object.
> Anyway, the only value we can bring on top of the range abstraction is by
>establishing invariants.
> It makes sense to establish the strongest one; though I am not opposed to just
>checking for UTF validity.
>
There are many things to check, checking for valid UTF is just one of the most
basic things to do when you get a text from untrusted resource.
>
> >> All of this is trivial to implement quickly with my Unicode library.
> >>
> >
> > No, it is not.
>
> I know better what I described and what my library is capable of, thank you.
>
>
> > Your Unicode library is locale agnostic which makes it quite
> > useless in too many cases.
>
> In the common case, you don't care (nor want to care) about a locale.
>
>
> > Almost every added function was locale sensitive:
> >
> > - search
> > - collation
> > - case handling
> >
> > And so on. This is major drawback of your library that
> > it is not capable of doing locale sensitive algorithms
> > that are vast majority of the Unicode algorithms
>
> Search up to the combining character sequence boundary is locale-agnostic.
I'm talking about several primary collation level that are locale agnostic.
> Search up to the grapheme boundary is virtually locale-agnostic (Unicode does
> not distribute locale alternatives, though it does hint at its possibility)
>
It provides CLDR - the locales database that has all the tables you need
> Case folding only has a couple of characters that are specific for Turkish,
> making it quite reasonably locale-agnostic.
>
If I do not mistake case folding is locale agnostic not case mapping which is
locale sensitive.
And "quite reasonably locale-agnostic" is not the answer for Turkish speaker
:-)
Same as you can say the text in generally LTR with small exception of Hebrew,
Arabic
and Persian... So why care?
> Collation depends on a special table; Unicode only provides a default one,
> which aims at being as locale-agnostic as possible. It also hosts a
repository
> where one can get alternative tables.
>
Any reasonable Unicode library must use them:
ICU uses CLDR
Windows Unicode API uses CLDR
Even CLDR provides tables for Posix API to make it more
convinient.
So ignoring CLDR is just wrong.
CLDR is integrated part of Unicode as its algorithms
and character properties database.
>
> Anyway, those are mere details; you can always change the backend for
> one tailored to your locale.
>
I'm rather talking about the concept.
I do like idea to have full Unicode library in boost,
but it should be done right.
There are many non-trivial problems with ICU but it
still the best library we have around and huge
amount of work was put in it.
Artyom
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk