Boost logo

Boost :

From: Dave Abrahams (abrahams_at_[hidden])
Date: 1999-09-07 07:58:59


John expressed (summarizing):
> There are several problems with dfa based implimentations -
>
> The problem here is that (expression){x,y} reqires at least:
>
> ( states_in_expression * x) + ( 2 * states_in_expression * y)
>
> states to expand out, so nested bounded repeats can quickly run your
> machine out of swap space - as the space required grows exponentially.
>
> Let me put this as an assertion: You cannot match back references using a
> dfa.
>
> *************
>
> All of the above ignores the main advantages of a dfa: you can guarentee
> polynonial worst case performance, if you're involved in setting a standard
> I can see that this would be (very) important.

All of the above pretty much matches my (weak) understanding of DFA based
matchers. Since my original regular expression experience is with lex, I
tend never to find myself wanting the troublesome features. On the other
hand, I _usually_ don't care about speed all that much in these cases.

> *************
>
> I would set three regular expression "levels":
>
> 1) full re includes back references - all bets are off - requires an nfa or
> a crippled dfa that exhibits non-deterministic behaviour.
> 2) no back references, but full sub-expression matching - dfa's can do
> this, but for carefully written expressions nfa's may be quicker (and yes
> that's a subjective un-substantiated assertion!)
> 3) no back references, no sub expression matching, dfa's start to look
> really good.
> 4) don't care at all what matched (ie grep), can use bit-parrellism to
> speed up the state machine (see glimpse/agrep).

It seems an obvious choice to at least consider providing a DFA
implementation for expressions it can handle, doesn't it?

> ************
>
>>the exact problem I had with the Perl-style match? I think it was that the
> match could get "stuck" in a greedy match of an alternative in an
> expression
> made up of terms "or"ed together... something like that. In other words, it
> is possible to write an expression of the form: A|B|C... (where A,B,C...
> are
> sub-expressions) such that it would fail to properly match text that would
> have been matched by one of A,B,C... alone.<
>
> Perl matches whatever it finds first. POSIX requires longest match, and
> this is what regex++ does.
>
> consider:
>
> a*ab against "aab" : perl finds only "aaa", POSIX/regex++ finds "aaab"
> a|ab against "ab" : perl finds only "a", regex++ finds "ab"
> a+(ab)* against "aabab" perl finds "aa", regex++ finds "aabab"

Ah, that sounds like what I was thinking about. In that case, I'm pretty
happy with regex++.

> If you want to experiment build the timer test app: you can then
> interactively enter expressions and text input and see what happens (both
> what matches and how long it takes) if nothing else you can explore the
> boundaries of pathological behaviour for nfa's !

Um, sounds like, er... fun?

> PS it occures to me that there are many other pattern matching algorithm,
> which are not currently catered for (Knuth Morris Pratt, Boyer-Moore,
> Multi-string searches etc), maybe what is needed is some kind of generic
> "pattern" encapsulation, with match, search, grep etc operations applied to
> it, just a thought.....

Sounds challenging, to say the least!


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk