Boost logo

Boost :

Subject: Re: [boost] [string] Realistic API proposal
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-01-29 15:34:34


On 28/01/2011 14:58, Artyom wrote:

>> What am I paying for? I don't see how I gain anything.
>>
>
> You don't pay on validation of the UTF-8 especially when 99% of uses
> of the string are encoding-agnostic.

I asked for what I gained, not what I did not lose.

>>> // UTF validation
>>>
>>> bool is_valid_utf() const;
>>
>> See, that's what makes the whole thing pointless.
>
> Actually not, consider:
>
> socket.read(my_string);
> if(!my_string.is_valid_utf())
> ....

Could be a free function, and would actually be *better* as a free
function, because you could apply it on any range, not just your type.

>
>> Your type doesn't add any semantic value on top of std::string,
>> it's just an agglomeration of free functions into a class. That's a terrible
>> design.
>> The only advantage that a specific type for unicode strings would bring is
>> that it could
>> enforce certain useful invariants.
>>
>
> You don't need to enforce things you don't care 99% of cases.

You don't get the point.

Your type doesn't add any information on top of std::string. Therefore
it is meaningless.

It's just an agglomeration of functions, in C++ we use namespaces for
that, not classes.

>> Enforcing that the string is in a valid UTF encoding and is normalized
>> in a specific normalization form can make most Unicode algorithms several
>> orders of magnitude faster.
>
> You do not always want to normalize text. It is user choice you
> may have optimized algorithms for already normalized strings
> but it is not always the case.

If my strings are valid and normalized, I can compare them with a simple
binary-level comparison; likewise for substring search, where I may also
need to add a boundary check if I want fine-grain search.

What you want to do is implement comparison by iterating through each
lazily computed code point and comparing them. This is at least 60 times
as slow; it also doesn't really compare equivalent characters in the
strings.

To get correct behaviour when comparing strings, they should be
normalized. Normalization is costly, so you don't want to do it at each
comparison, but only once.
In practice, all data available everywhere should already be in NFC (XML
mandates it, for example) and checking whether a string is normalized is
very fast (while less fast than checking if a string is valid UTF-8,
since you still need to access a table, which might hurt the cache, and
is not vectorizable).

Dealing with potentially invalid UTF strings can be highly dangerous as
well, exploits for that kind of thing are common-place.
I suspect denormalized Unicode could be sensitive too, since in some
parts of your application 00e0 (à) and 0061 0300 (a + `) could compare
equal but not in others, depending on what that string went through,
causing inconsistencies.

Anyway, the only value we can bring on top of the range abstraction is
by establishing invariants.
It makes sense to establish the strongest one; though I am not opposed
to just checking for UTF validity.

But no checking at all? There is no point.
You might as well make your string type a typedef of std::string.

> Also what kind of normalization NFC? NFKC?

NFC, of course. It takes less space and doesn't make you lose anything.
If you want to work in decomposed forms or something else, use your own
container and not the adaptor.

Remember, this whole thing is just there to help you deal with the
general case in a practical, correct and efficient way.
The real algorithms are fully generic, and allow you to do whatever you
want; they accept both normalized and un-normalized strings, data
regardless of its memory layout, etc.

>> All of this is trivial to implement quickly with my Unicode library.
>>
>
> No, it is not.

I know better what I described and what my library is capable of, thank you.

> Your Unicode library is locale agnostic which makes it quite
> useless in too many cases.

In the common case, you don't care (nor want to care) about a locale.

> Almost every added function was locale sensitive:
>
> - search
> - collation
> - case handling
>
> And so on. This is major drawback of your library that
> it is not capable of doing locale sensitive algorithms
> that are vast majority of the Unicode algorithms

Search up to the combining character sequence boundary is locale-agnostic.
Search up to the grapheme boundary is virtually locale-agnostic (Unicode
does not distribute locale alternatives, though it does hint at its
possibility)

Case folding only has a couple of characters that are specific for
Turkish, making it quite reasonably locale-agnostic.

Collation depends on a special table; Unicode only provides a default
one, which aims at being as locale-agnostic as possible. It also hosts a
repository where one can get alternative tables.

Anyway, those are mere details; you can always change the backend for
one tailored to your locale.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk