Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2005-06-24 05:26:56


> No. We're only talking about case folding -- specifically the mappings
> found in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.

Well maybe you are, but the regex traits clases were always intended to
allow for other forms of equivalence as well.

>> However, if we try to do it on the fly things get expensive - having an
>> API
>> that returns a string containing all equivalents to a character is a
>> non-starter IMO, it would potentially be called for every single input
>> character in the string being matched, allocating and returning a string
>> for
>> each one would grind performance to a halt.
>
>
> "string_type" would have to be a type that uses the Small String
> Optimization to avoid allocs. These strings will be short (a max of 4
> characters for simple Unicode case folding). But see below.

Or we could return a reference to a string_type I guess.

> We could change the API to
>> return a pair of iterators, or even something like:
>>
>> charT enumerate_equivalents(charT c); // return the next character
>> equivalent to c
>>
>> But instead of essentially one operation per input character, we'd have N
>> operations, if there are N equivalent characters. Which may or may not
>> be
>> an issue in practice.
>>
>> The main objection I have is that these API's are a lot harder to
>> implement
>> than the existing interface - currently in most cases a tolower will do
>> the
>> job, and even for fairly strict Unicode conformance a tolower(toupper(c))
>> will work.
>
>
> I don't see how. Can you explain?

Well they're harder than just a tolower anyway - you would need to enumerate
through the entire code set and build a big table of equivalents and then
dump that out as bunch C++ data declarations ready for the new API to
return. This is actually duplicating the data that's already present in our
C runtimes, and/or ICU or other libraries, but just isn't accessible in a
form we would like.

> If the interface is too hard to implement then it becomes
>> useless as a traits class, and we might just as well get rid of it (I'd
>> really rather not go down this road, but it has been suggested before).
>>
>
>
> The regex_traits should make it possible for implementers to do full
> Unicode case folding *if they desire*. That's not the case now.

That's what happens in type u32regex in Boost.Regex now.

> Here's my suggestion. We add to the traits class two functions:
>
> bool in_range(Char from, Char to, Char ch);

What use is this one, or are you allowing equivalents other than case
folding now <wink>? If so then I approve :-)

> bool in_range_nocase(Char from, Char to, Char ch);

OK, but see below.

> We define the behavior of the regex engine in terms of these functions,
> but we don't require their use. In particular, for narrow character
> sets, implementers would be free to use a std::bitset<256>, enumerate
> the char range [from, to], call translate_nocase on each char, and set
> the appropriate bit in the bitset. Matching happens by calling
> translate_nocase on the input char and seeing if its bit is set in the
> bitset. That gives the same behavior.

I don't like traits class API's that may or may not be called: what happens
if a user defined traits class is provided that alters the behavior of
in_range, but not translate? The side effects produced by these API's are
clearly visible.

> For wide character sets, implementors will in all likelyhood be storing
> a sparse vector of [from, to] ranges. Matching happens by calling
> regex_traits::in_range_nocase(from, to, *in). What does this function
> do? Whatever the regex_traits implementer want it to. They can simply
> use ctype::toupper and ctype::tolower and do the easy thing. Or they can
> do full-on Unicode case folding if they want.
>
> I think it's OK for TR1 (and C++0x?) to specify the default
> regex_traits::in_range_nocase behavior solely in terms of ctype. The
> hope is that eventually, C++ will get real Unicode support, and we can
> require more of regex_traits then. But the key is giving people a way to
> get full Unicode support if they so choose.

I agree Unicode support is clearly desirable: however on point of
proceedure, I believe it's too late to change this for TR1, changes for
C++0x are clearly still possible though. Whatever we need to file this as a
DR.

> As an interesting data point, I wonder how the regex package in ICU
> handles this.
>
>
> <aside>
> Have you seen http://www.unicode.org/reports/tr18/? It's the Unicode
> Consortium's recommendations for Unicode-compliant regex. Very sobering.
> I would like our goal with the regex_traits to be to provide hooks so
> that full Level 3 compliance with TR18 is possible (but not required).
> We're far from that goal now. I'm pretty sure we couldn't even provide
> Level 1, which requires the proper handling of surrogate pairs.
> </aside>

Yes I'm familiar with that, I had to twist the regex interface a little to
handle different Unicode encoding formats, but you can now search UTF-8,
UTF-16 or UTF-32 text with Boost.Regex (see
libs/regex/doc/icu_strings.html).

Boost.Regex conformance to the Unicode Regex TR, is documented at
libs/regex/doc/standards.html.

And you're right, there's still plenty to do.

The most pressing point for level 1 support is section 1.5 Caseless
Matching: "Supported, note that at this level, case transformations are
1:1, many to many case folding operations are not supported (for example "ß"
to "SS"). "

This is a real gotcha, neither your suggested API's above, nor the C/C++
API's provide support for many to many case transformations. The problem to
be solved here is analogous to that of canonical equivalence, it would be
much easier to solve by processing both the string and the expression into
the same normalised form (think iterator adapters), except that wouldn't be
ECMA compatible again :-(

Just when you thought you had a hold on it, something comes along and bites
you on the **** :->

Ducks and runs for cover, John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk