Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2006-01-03 12:22:44

John Maddock wrote:
>>The regex "a(b)?c\\1d" successfully matches the string "acd". It
>>shouldn't. A back-reference to a sub-matche that didn't participate in
>>the match should not match. Perl, python and xpressive all agree on
>>this point.
> Hmmm, I never did know what to do in that case, and I suspect Boost.Regex
> has changed several times in this area. Eric, you're right about Perl, and
> PCRE, but can you take a look at the ECMAScript spec and double check my
> reading of it: it looks to me as if section AtomEscape Item 8.3
> requires backreferences to non-matched sub-expressions to be treated as if
> they were simply skipped?
> I'm sure I deliberately made it behave that way, but it's all lost in the
> mists of time as to why: must remember to add to more comments, a New Years
> Resolution!!!

Oh, my! You're right. "If the regular expression has n or more capturing
parentheses but the nth one is undefined because it hasn't captured
anything, then the backreference always succeeds."

Yuk! This makes ECMA (and hence C++) regular expressions differ from
perl/python/PCRE in a fundamental way, and disallows at least one useful
and idiomatic perl regex construct. How unfortunate.

>>As discussed previously, Boost.Regex treats [a-Z] as a legal regex,
>>but it isn't. 'a' is 97 and 'Z' is 90, which makes this character
>>range ill-formed, even when icase is specified.
> Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when
> icase is on. The problem here is the desire to support Unicode, we could
> either:
> 1) Do what ICU does and enumerate every character in the range [x-y] and
> convert it to it's case-folded equivalent. The trouble is it's
> pathologically slow for very large ranges. Of course if you use a very
> large range you get exactly what you deserve!

Please, no.

> 2) Do what I presume xpressive does (haven't looked, sorry in a rush today!)
> and match a character c if c has an equivalent case in the range x-y. This
> is what ECMAScript specifies I think? The problem is case folding in
> Unicode is *very* complicated, and using toupper/tolower to map case
> equivalents doesn't even come close (there are at least three cases for a
> start). As far as I know you more or less *have* to do case folding to get
> the right answer for Unicode characters.
> 3) Do what Boost.Regex does and do case-folding on the fly. It works
> correctly for 98% of cases, but accepts a few invalid expressions
> incorrectly. At present I'm inclined to suggest that anyone who uses [a-Z]
> in a regex gets what they deserve anyway :-) So I'm not going to fix this
> one for now: having it as a test case is questionable though ;-)

4) Punt the decision to the traits type :-) For xpressive, I added a
in_range_nocase(Char a, Char b) member to the traits concept. By default
the traits provided by xpressive do *not* do proper case folding. They
just use toupper and tolower, and are documented as such. An ambitious
person can write their own trait to do proper Unicode case folding and
get the right behavior.

This interface works, but it's sub-optimal since it causes more calls to
toupper/tolower than are necessary. I have a mental sticky note to find
a better interface.

>>When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the
>>third sub-match should be "c", but perl says it should not participate
>>in the match. I think perl is right here. The logic is: the quantified
>>group 1 will match at least 3 times: first, it eats the b, next it
>>eats the c, and finally, it matches an empty string. On this last
>>iteration, the quantified group 3 will match zero times; hence, it
>>has not participated in the match. (FWIW, xpressive has a bug in this
>>area too, which I'm working on fixing.)
> Actually it's only one case where this is relevent I believe. First note
> that all the cases that use POSIX extended syntax are matched using the
> "Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then
> the last match of $1, $2, $3 id against the "c" as this is "better" (The
> same in $0 as your case but both longer and more to the left in $1/$2/$3
> than the alternative you give).
> In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct
> to me: $1 and $2 match the empty string after the "b", and $3 is unmatched.
> In the second case a(b+|((c)*))+d against "abcd" things get tricky:
> Boost.Regex does repeat $1 and $2 so that they match the null string after
> the "c", however $3 doesn't get invoked at all and is left pointing to the
> last thing it did match (the "c").

But $3 does get involved. It is matched 0 times, and should be undefined.

   I can't see anything in ECMAScript to
> indicate what the "right" behaviour is here. I don't know what Perl does,
> but PCRE (and hence python) does the same as Boost.Regex.
> There's a whole series of possible expressions like this:
> (a(((x)))b|a(((y)))b)+
> The question being whether each time we take an alternative or repeat all
> the nested sub-expressions get unset, or whether they continue to remember
> what they matched last time through. Unsetting them looks tricky IMO.
> Looking into this some more: PCRE appears to never unset them, Perl appears
> to unset them, but only in this specific case:
> $string="abcd";
> $string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/;
> print($string);
> prints {}{}{c}{c} which is the same as Boost.Regex and PCRE

Yes, I believe this is a Perl bug, and I filed it yesterday.

> $string="abcd";
> $string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/;
> print($string);
> prints {}{}{} which is just plain weird - inconsistent certainly - and the
> extra parenthesis shouldn't have effected $3 one jot. I'm verging on
> suggesting that this is a Perl bug.

Inconsistent. Yes. Weird? No. The simple rule is: when you're about to
start repeating a group, that capture and all captures within are first
set to undefined. (ECMA-262 Now that I look more closely, I
see that ECMA is stricter about setting captures to undefined in these
situations than Perl is, and xpressive is non-compliant in this area,
too. <sigh>

ECMA-262 gives this example:

   which returns the array
   ["zaacbbbcac", "z", "ac", "a", undefined, "c"]

Eric Niebler
Boost Consulting

Boost list run by bdawes at, gregod at, cpdaniel at, john at