Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2005-06-20 01:25:19


I've had a nagging feeling about character ranges and the TR1 regex
proposal, so today I did some digging. I've found what I think is a
showstopping issue for TR1 regex. I hope I'm wrong.

The situation of interest is described in the ECMAScript specification
(ECMA-262), section 15.10.2.15:

"Even if the pattern ignores case, the case of the two ends of a range
is significant in determining which characters belong to the range.
Thus, for example, the pattern /[E-F]/i matches only the letters E, F,
e, and f, while the pattern /[E-f]/i matches all upper and lower-case
ASCII letters as well as the symbols [, \, ], ^, _, and `."

A more interesting case is what should happen when doing a
case-insentitive match on a range such as [Z-a]. It should match z, Z,
a, A and the symbols [, \, ], ^, _, and `. This is not what happens with
Boost.Regex (it throws an exception from the regex constructor).

The tough pill to swallow is that, given the specification in TR1, I
don't think there is any effective way to handle this situation.
According to the spec, case-insensitivity is handled with
regex_traits<>::translate_nocase(CharT) -- two characters are equivalent
if they compare equal after both are sent through the translate_nocase
function. But I don't see any way of using this translation function to
make character ranges case-insensitive. Consider the difficulty of
detecting whether "z" is in the range [Z-a]. Applying the transformation
to "z" has no effect (it is essentially std::tolower). And we're not
allowed to apply the transformation to the ends of the range, because as
ECMA-262 says, "the case of the two ends of a range is significant."

So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix
is to redefine translate_nocase to return a string_type containing all
the characters that should compare equal to the specified character. But
this function is hard to implement for Unicode, and it doesn't play nice
with the existing ctype facet. What a mess!

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk