Boost logo

Boost Users :

Subject: Re: [Boost-users] [regex] why partial match / early break ?
From: Anthony Foiani (tkil_at_[hidden])
Date: 2011-11-01 14:10:26


"U.Mutlu" <for-gmane_at_[hidden]> writes:

> why is this regex
> const string sRe =
> "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])\\.)+"
> "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))";
>
> not matching this string wholly?
> "a1a.a2a.a3a.a4aaaa"
>
> It rather matches only this part:
> "a1a.a2a.a3a.a4"
>
> What's the problem here?
> (I know I can append a delimiter to solve the problem,
> but I think the regex should've match it wholly, shouldn't it?)

I think you're getting caught by "first match, not longest match".
Put differently, the regex engine isn't backtracking where you think
it is. Looking at the last segment of your target string, the match
goes like so:

  ((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))
     ^
Ok, "a" matches this charset.

  ((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))
                                       ^

But "4" doesn't match ([a-zA-Z]), so we call the quantifier done.

  ((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))
                                        ^

The "4" does match here, and that completes the match.

As you said, you can put ^ and $ if you want to force the match to the
entire string.

I'm also curious if you're capturing those substrings for some
purpose? If not, you can simplify it quite a bit (and probably make
it faster, since regular parens require the RE engine to do a fair bit
of extra work):

  ([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+

Becomes:

  [a-zA-Z][a-zA-Z0-9\\-]*

Also, note that "-" should be treated as a regular character if it's
at the beginning or end of a character range, so we can drop those
backslashes:

  [a-zA-Z][a-zA-Z0-9-]*

And I'd even consider using the preprocessor to make it a bit more
readable:

  #define LETTER "a-zA-Z"
  #define DIGIT "0-9"
  #define DASH "-"

  #define ELEMENT \
    "[" LETTER "]" "[" LETTER DIGIT DASH "]*" "[" LETTER DIGIT "]"

  const string re( "(?:" ELEMENT "\\." ")+" ELEMENT );

  #undef ELEMENT
  #undef DASH
  #undef DIGIT
  #undef LETTER

(I suppose that using actual C++ strings, with a compiler that can do
string expression evaluation at compile time, would be just as good if
not preferable...)

This also makes clear the (possibly surprising) issue that you won't
match single-character elements; I don't know if that's intentional or
not.

If you want to say "an element starts with a letter followed by any
number of letters, digits, or dashes, but must end with a letter or
digit", then you might prefer:

  #define ELEMENT \
    "[" LETTER "]" "[" LETTER DIGIT DASH "]*" "(?<!" DASH ")"

(Although look-behind is a mildly uncommon construct, so that might
cause some consternation in code review...)

Finally, if you're doing validation on each element individually
anyway, you might simply allow the RE to match on the basic pattern:

  #define ELEMENT \
    "[" LETTER "]" "[" LETTER DIGIT DASH "]*"

And then do a procedural check for a "-" at the end of each match.

Best regards,
Tony


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net