|
Boost Users : |
From: John Maddock (john_maddock_at_[hidden])
Date: 2002-08-16 06:16:35
> I sent this originally to James Maddock, but realized this is
> probably a better place to post it.
Sorry it took me a while to get around to it.
> At work we've been testing out regex (nice work BTW) in some of our
> code, and appear to have found a bug. We ran into it parsing HTML,
> and I've written a test C++ app to reproduce it.
>
> In the program below, the output should be the same for both searches
> as far as I can tell, but it's not. I don't know if it's some
> interaction with the quote character or something like that. We
> attempted to use other quantifiers (after '?', we tried '*', '{0,1}',
> ["]?) to no avail. I'm confident this is not user error. The extra
> grouping is annoying (in "goodPatternStr"), but is an acceptable
> workaround. The strange thing is that a non-capturing group doesn't
> fix it.
>
> Ideas?
I have an answer for you, but I don't think you're going to like it: it
comes down to how the "leftmost longest" rules are applied:
what's happening here is that $1 is being matched, but it's matching the
null string just before the \" (at character 26 I think it was), the
alternative (the one you expected), would have matched starting at character
27 (just one to the right of the \"). So the match found is in some sense
"better" (further to the left) that the one you expected. I think I'm going
to have to switch to perl matching rules so I can stop explaining this...
:-)
A simpler solution to your problem is to use a + quantifier rather than a *,
so that it can't match the null string:
const char* badPatternStr = "<input[^>]*name=\"?([^>
\"]+)[^>]*value=\"?([^> \"]+)";
Hope this helps,
John Maddock
http://ourworld.compuserve.com/homepages/john_maddock/index.htm
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net