|
Boost : |
From: John Maddock (John_Maddock_at_[hidden])
Date: 1999-09-05 05:36:42
>I directed a friend of mine who is a regex guru at your package. He said
it
was interesting, but, in his words "it uses POSIX syntax but unfortunately
doesn't have POSIX semantics. He says he saw the words 'leftmost longest'
in
the POSIX documentation but didn't know what they meant so he implemented
what Perl does". Now maybe my friend was misinterpreting what you wrote,
but
what he says POSIX does is to build a DFA for the expression which
recognizes ALL alternatives simultaneously. The Perl style of interpreting
regular expressions leads to backtracking and all sorts of other
inefficiencies, including failure to recognize strings which match certain
expressions or combinatoric explosion. I am strongly in favor of a DFA
implementation.<
Your half right and half wrong - it does do leftmost longest matching as
specified by POSIX - except that the diffinition in the standard is known
to be deeply ambiguous - there are some defect reports on the web
somewhere. Having said that the defects in the standard relate only to sub
expression matching, not the overall match. Regex++ finds the longest
possible match for each sub-expression, but there are other possible
interpretations (for example the "leftmost" part of the rule could be
applied to sub-expressions as well as the overall match), in any event it
does *not* do what perl does, although it is my intent to add this as a
compatability option at some future time.
It is true that it is implimented as an nfa not a dfa, as I say on my web
page, regex++ is aimed more at matching/parsing than at grep-style search
applications.
If you want a dfa that's fair enough, as long as you realise that dfa's are
potentially exponential in memory usage, and can be problematic with
respect to back references - although how many people ever use that feature
I don't know... supporting multi-character collating elements also adds at
the very least a great deel of complexity to dfa implimentations (think of
[^[.ae.]] for example).
I guess what I'm trying to say is that I made an implimentation choice
based upon what I wanted the library *for*, other implimentation choices
are better for other uses - something like agrep being a good example.
There is no universal panasea for regular expressions - (or at least if
there is let me know!) - almost all implimentations have well known
"pathological" expressions that break the code:
(((.*)*)*) will break nfa's
(((.{0, 255}){0,255}){0,255}){0,255} will break dfa's
Anyway, enough, as I say there are good reasons for going either way IMO,
if you would prefer a dfa then that's no problem.
John.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk