Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2005-06-23 06:20:07


> I've had a nagging feeling about character ranges and the TR1 regex
> proposal, so today I did some digging. I've found what I think is a
> showstopping issue for TR1 regex. I hope I'm wrong.
>
> The situation of interest is described in the ECMAScript specification
> (ECMA-262), section 15.10.2.15:
>
> "Even if the pattern ignores case, the case of the two ends of a range
> is significant in determining which characters belong to the range.
> Thus, for example, the pattern /[E-F]/i matches only the letters E, F,
> e, and f, while the pattern /[E-f]/i matches all upper and lower-case
> ASCII letters as well as the symbols [, \, ], ^, _, and `."
>
> A more interesting case is what should happen when doing a
> case-insentitive match on a range such as [Z-a]. It should match z, Z,
> a, A and the symbols [, \, ], ^, _, and `. This is not what happens with
> Boost.Regex (it throws an exception from the regex constructor).
>
> The tough pill to swallow is that, given the specification in TR1, I
> don't think there is any effective way to handle this situation.
> According to the spec, case-insensitivity is handled with
> regex_traits<>::translate_nocase(CharT) -- two characters are equivalent
> if they compare equal after both are sent through the translate_nocase
> function. But I don't see any way of using this translation function to
> make character ranges case-insensitive. Consider the difficulty of
> detecting whether "z" is in the range [Z-a]. Applying the transformation
> to "z" has no effect (it is essentially std::tolower). And we're not
> allowed to apply the transformation to the ends of the range, because as
> ECMA-262 says, "the case of the two ends of a range is significant."
>
> So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix
> is to redefine translate_nocase to return a string_type containing all
> the characters that should compare equal to the specified character. But
> this function is hard to implement for Unicode, and it doesn't play nice
> with the existing ctype facet. What a mess!

OK, first up I think you have discovered a problem, and in particular any
incompatibility between the TR1 proposal and ECMAScript is entirely
accidental.

I can't see any way of strictly implementing the ECMA semantics within the
scope of the current regex traits classes either.

However, as noted previously, no one has ever reported this as a bug in
practice, and anyone writing expressions like [Z-a] needs a poke with a
sharp stick :-) Obviously that's no excuse for incompatibility, just that
the breakage is less "fundamental" than it may appear at first.

Thinking out loud now, lets try and consider how this might be fixed:

I'm assuming that the correct semantics are:

"Given a character range [c1-c2] then character c3 matches this range if
there is some character c4, that is equivalent to c3, and whose character
code lies in the range [c1-c2]."

As I understand it ECMA script assumes that for any character c, there is at
most one other character that is considered equivalent to it (it's opposite
case), which is clearly implementable either by providing toupper/tolower
members to the traits class, or just an "other_case" member function. It's
easily implementable in terms of ctype as well. Whew.

Unfortunately that's not the end of the story, for Unicode characters there
are more than two cases (Upper Lower and Title cases), and that's without
considering other forms of equivalence - for example the Angstrom symbol
(U+212B) and character U+00C5 ("LATIN CAPITAL LETTER A WITH RING ABOVE") are
normally considered as equivalents, likewise full and half width Hangul
characters may be considered as equivalent for some uses.

The trouble is, for wide character ranges, there is no way we could
enumerate all the equivalents for the whole range and build a big list of
all the characters that could match - potentially there are thousands of
them.

However, if we try to do it on the fly things get expensive - having an API
that returns a string containing all equivalents to a character is a
non-starter IMO, it would potentially be called for every single input
character in the string being matched, allocating and returning a string for
each one would grind performance to a halt. We could change the API to
return a pair of iterators, or even something like:

charT enumerate_equivalents(charT c); // return the next character
equivalent to c

But instead of essentially one operation per input character, we'd have N
operations, if there are N equivalent characters. Which may or may not be
an issue in practice.

The main objection I have is that these API's are a lot harder to implement
than the existing interface - currently in most cases a tolower will do the
job, and even for fairly strict Unicode conformance a tolower(toupper(c))
will work. If the interface is too hard to implement then it becomes
useless as a traits class, and we might just as well get rid of it (I'd
really rather not go down this road, but it has been suggested before).

OK, that's my brain dump over for now!

Regards, John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk