Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2005-06-23 11:04:00


John Maddock wrote:
>
> I'm assuming that the correct semantics are:
>
> "Given a character range [c1-c2] then character c3 matches this range if
> there is some character c4, that is equivalent to c3, and whose character
> code lies in the range [c1-c2]."

Yes.

>
> As I understand it ECMA script assumes that for any character c, there is at
> most one other character that is considered equivalent to it (it's opposite
> case), which is clearly implementable either by providing toupper/tolower
> members to the traits class, or just an "other_case" member function. It's
> easily implementable in terms of ctype as well. Whew.

Yes.

>
> Unfortunately that's not the end of the story, for Unicode characters there
> are more than two cases (Upper Lower and Title cases)

Yes.

, and that's without
> considering other forms of equivalence - for example the Angstrom symbol
> (U+212B) and character U+00C5 ("LATIN CAPITAL LETTER A WITH RING ABOVE") are
> normally considered as equivalents, likewise full and half width Hangul
> characters may be considered as equivalent for some uses.

No. We're only talking about case folding -- specifically the mappings
found in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.

>
> The trouble is, for wide character ranges, there is no way we could
> enumerate all the equivalents for the whole range and build a big list of
> all the characters that could match - potentially there are thousands of
> them.

Right. I don't think we have to.

>
> However, if we try to do it on the fly things get expensive - having an API
> that returns a string containing all equivalents to a character is a
> non-starter IMO, it would potentially be called for every single input
> character in the string being matched, allocating and returning a string for
> each one would grind performance to a halt.

"string_type" would have to be a type that uses the Small String
Optimization to avoid allocs. These strings will be short (a max of 4
characters for simple Unicode case folding). But see below.

   We could change the API to
> return a pair of iterators, or even something like:
>
> charT enumerate_equivalents(charT c); // return the next character
> equivalent to c
>
> But instead of essentially one operation per input character, we'd have N
> operations, if there are N equivalent characters. Which may or may not be
> an issue in practice.
>
> The main objection I have is that these API's are a lot harder to implement
> than the existing interface - currently in most cases a tolower will do the
> job, and even for fairly strict Unicode conformance a tolower(toupper(c))
> will work.

I don't see how. Can you explain?

   If the interface is too hard to implement then it becomes
> useless as a traits class, and we might just as well get rid of it (I'd
> really rather not go down this road, but it has been suggested before).
>

The regex_traits should make it possible for implementers to do full
Unicode case folding *if they desire*. That's not the case now.

Here's my suggestion. We add to the traits class two functions:

bool in_range(Char from, Char to, Char ch);
bool in_range_nocase(Char from, Char to, Char ch);

We define the behavior of the regex engine in terms of these functions,
but we don't require their use. In particular, for narrow character
sets, implementers would be free to use a std::bitset<256>, enumerate
the char range [from, to], call translate_nocase on each char, and set
the appropriate bit in the bitset. Matching happens by calling
translate_nocase on the input char and seeing if its bit is set in the
bitset. That gives the same behavior.

For wide character sets, implementors will in all likelyhood be storing
a sparse vector of [from, to] ranges. Matching happens by calling
regex_traits::in_range_nocase(from, to, *in). What does this function
do? Whatever the regex_traits implementer want it to. They can simply
use ctype::toupper and ctype::tolower and do the easy thing. Or they can
do full-on Unicode case folding if they want.

I think it's OK for TR1 (and C++0x?) to specify the default
regex_traits::in_range_nocase behavior solely in terms of ctype. The
hope is that eventually, C++ will get real Unicode support, and we can
require more of regex_traits then. But the key is giving people a way to
get full Unicode support if they so choose.

As an interesting data point, I wonder how the regex package in ICU
handles this.

<aside>
Have you seen http://www.unicode.org/reports/tr18/? It's the Unicode
Consortium's recommendations for Unicode-compliant regex. Very sobering.
I would like our goal with the regex_traits to be to provide hooks so
that full Level 3 compliance with TR18 is possible (but not required).
We're far from that goal now. I'm pretty sure we couldn't even provide
Level 1, which requires the proper handling of surrogate pairs.
</aside>

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk