Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2006-01-03 08:54:58


> The regex "a(b)?c\\1d" successfully matches the string "acd". It
> shouldn't. A back-reference to a sub-matche that didn't participate in
> the match should not match. Perl, python and xpressive all agree on
> this point.

Hmmm, I never did know what to do in that case, and I suspect Boost.Regex
has changed several times in this area. Eric, you're right about Perl, and
PCRE, but can you take a look at the ECMAScript spec and double check my
reading of it: it looks to me as if section 15.10.2.9 AtomEscape Item 8.3
requires backreferences to non-matched sub-expressions to be treated as if
they were simply skipped?

I'm sure I deliberately made it behave that way, but it's all lost in the
mists of time as to why: must remember to add to more comments, a New Years
Resolution!!!

> As discussed previously, Boost.Regex treats [a-Z] as a legal regex,
> but it isn't. 'a' is 97 and 'Z' is 90, which makes this character
> range ill-formed, even when icase is specified.

Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when
icase is on. The problem here is the desire to support Unicode, we could
either:

1) Do what ICU does and enumerate every character in the range [x-y] and
convert it to it's case-folded equivalent. The trouble is it's
pathologically slow for very large ranges. Of course if you use a very
large range you get exactly what you deserve!
2) Do what I presume xpressive does (haven't looked, sorry in a rush today!)
and match a character c if c has an equivalent case in the range x-y. This
is what ECMAScript specifies I think? The problem is case folding in
Unicode is *very* complicated, and using toupper/tolower to map case
equivalents doesn't even come close (there are at least three cases for a
start). As far as I know you more or less *have* to do case folding to get
the right answer for Unicode characters.
3) Do what Boost.Regex does and do case-folding on the fly. It works
correctly for 98% of cases, but accepts a few invalid expressions
incorrectly. At present I'm inclined to suggest that anyone who uses [a-Z]
in a regex gets what they deserve anyway :-) So I'm not going to fix this
one for now: having it as a test case is questionable though ;-)

> When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the
> third sub-match should be "c", but perl says it should not participate
> in the match. I think perl is right here. The logic is: the quantified
> group 1 will match at least 3 times: first, it eats the b, next it
> eats the c, and finally, it matches an empty string. On this last
> iteration, the quantified group 3 will match zero times; hence, it
> has not participated in the match. (FWIW, xpressive has a bug in this
> area too, which I'm working on fixing.)

Actually it's only one case where this is relevent I believe. First note
that all the cases that use POSIX extended syntax are matched using the
"Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then
the last match of $1, $2, $3 id against the "c" as this is "better" (The
same in $0 as your case but both longer and more to the left in $1/$2/$3
than the alternative you give).

In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct
to me: $1 and $2 match the empty string after the "b", and $3 is unmatched.

In the second case a(b+|((c)*))+d against "abcd" things get tricky:
Boost.Regex does repeat $1 and $2 so that they match the null string after
the "c", however $3 doesn't get invoked at all and is left pointing to the
last thing it did match (the "c"). I can't see anything in ECMAScript to
indicate what the "right" behaviour is here. I don't know what Perl does,
but PCRE (and hence python) does the same as Boost.Regex.

There's a whole series of possible expressions like this:

(a(((x)))b|a(((y)))b)+

The question being whether each time we take an alternative or repeat all
the nested sub-expressions get unset, or whether they continue to remember
what they matched last time through. Unsetting them looks tricky IMO.
Looking into this some more: PCRE appears to never unset them, Perl appears
to unset them, but only in this specific case:

$string="abcd";
$string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/;
print($string);

prints {}{}{c}{c} which is the same as Boost.Regex and PCRE

$string="abcd";
$string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/;
print($string);

prints {}{}{} which is just plain weird - inconsistent certainly - and the
extra parenthesis shouldn't have effected $3 one jot. I'm verging on
suggesting that this is a Perl bug.

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk