Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2006-01-03 08:54:58

> The regex "a(b)?c\\1d" successfully matches the string "acd". It
> shouldn't. A back-reference to a sub-matche that didn't participate in
> the match should not match. Perl, python and xpressive all agree on
> this point.

Hmmm, I never did know what to do in that case, and I suspect Boost.Regex
has changed several times in this area. Eric, you're right about Perl, and
PCRE, but can you take a look at the ECMAScript spec and double check my
reading of it: it looks to me as if section AtomEscape Item 8.3
requires backreferences to non-matched sub-expressions to be treated as if
they were simply skipped?

I'm sure I deliberately made it behave that way, but it's all lost in the
mists of time as to why: must remember to add to more comments, a New Years

> As discussed previously, Boost.Regex treats [a-Z] as a legal regex,
> but it isn't. 'a' is 97 and 'Z' is 90, which makes this character
> range ill-formed, even when icase is specified.

Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when
icase is on. The problem here is the desire to support Unicode, we could

1) Do what ICU does and enumerate every character in the range [x-y] and
convert it to it's case-folded equivalent. The trouble is it's
pathologically slow for very large ranges. Of course if you use a very
large range you get exactly what you deserve!
2) Do what I presume xpressive does (haven't looked, sorry in a rush today!)
and match a character c if c has an equivalent case in the range x-y. This
is what ECMAScript specifies I think? The problem is case folding in
Unicode is *very* complicated, and using toupper/tolower to map case
equivalents doesn't even come close (there are at least three cases for a
start). As far as I know you more or less *have* to do case folding to get
the right answer for Unicode characters.
3) Do what Boost.Regex does and do case-folding on the fly. It works
correctly for 98% of cases, but accepts a few invalid expressions
incorrectly. At present I'm inclined to suggest that anyone who uses [a-Z]
in a regex gets what they deserve anyway :-) So I'm not going to fix this
one for now: having it as a test case is questionable though ;-)

> When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the
> third sub-match should be "c", but perl says it should not participate
> in the match. I think perl is right here. The logic is: the quantified
> group 1 will match at least 3 times: first, it eats the b, next it
> eats the c, and finally, it matches an empty string. On this last
> iteration, the quantified group 3 will match zero times; hence, it
> has not participated in the match. (FWIW, xpressive has a bug in this
> area too, which I'm working on fixing.)

Actually it's only one case where this is relevent I believe. First note
that all the cases that use POSIX extended syntax are matched using the
"Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then
the last match of $1, $2, $3 id against the "c" as this is "better" (The
same in $0 as your case but both longer and more to the left in $1/$2/$3
than the alternative you give).

In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct
to me: $1 and $2 match the empty string after the "b", and $3 is unmatched.

In the second case a(b+|((c)*))+d against "abcd" things get tricky:
Boost.Regex does repeat $1 and $2 so that they match the null string after
the "c", however $3 doesn't get invoked at all and is left pointing to the
last thing it did match (the "c"). I can't see anything in ECMAScript to
indicate what the "right" behaviour is here. I don't know what Perl does,
but PCRE (and hence python) does the same as Boost.Regex.

There's a whole series of possible expressions like this:


The question being whether each time we take an alternative or repeat all
the nested sub-expressions get unset, or whether they continue to remember
what they matched last time through. Unsetting them looks tricky IMO.
Looking into this some more: PCRE appears to never unset them, Perl appears
to unset them, but only in this specific case:

$string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/;

prints {}{}{c}{c} which is the same as Boost.Regex and PCRE

$string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/;

prints {}{}{} which is just plain weird - inconsistent certainly - and the
extra parenthesis shouldn't have effected $3 one jot. I'm verging on
suggesting that this is a Perl bug.


Boost list run by bdawes at, gregod at, cpdaniel at, john at