Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2006-12-18 12:31:20


Detlef Meyer-Eltz wrote:
> I have a difficulty to predict, which part of a regular expression
> will match.
>
> Example:
> I have a regular expression for a general HTML tag: <[^>]*>
> combined with an expression for the body tag: <body([^>]*)>
>
> to: (<[^>]*>)|(<body([^>]*)>)
>
> This expression matches the text: <body bgcolor="white">
>
> As both alternatives can match the input with the same length, I
> expected, that the repeated fouth part of the "Leftmost Longest" Rule
> would determine, which alternatve is chosen:
>
> 4. Find the match which has matched the first sub-expression in the
> leftmost position, along with any ties. If there is only on(e)
> such match possible then return it.
>
> // note the missing 'e'
>
> As the tag-expression has no sub-expression at all, the
> body-expression should win. Its sub-expression could match, but
> doesn't. It seems to me, that the sequence of the alternatives
> determines the match.
>
> Now I guess, that I misinterpreted 4.: its not a means to predict the
> matching alternative but only to find the one that matched
> accidentally? My software constructs lexers from elementary
> expressions automatically. So it's important for me to direct and
> predict the expected matching alternative. Are there any other rules?
> Does the sequence of the alternatives determine the match
> unmistakably?

Which Boost.Regex version are you using, and how are you compiling the
expression?

Recent versions default to the Perl matching rules: *which do not use the
leftmost longest rule*. They match based on a "first match found" rule, so
if the first alternative leads to a match then subsequent alternatives are
never examined.

If you really want leftmost-longest semantics, then compile the expression
as a POSIX extended regex, but of course then you loose the ability to use
Perl-like regex extensions.

HTH, John.

PS, your analysis of the leftmost-longest rule looks correct however.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net