Boost logo

Boost :

From: John Maddock (John_Maddock_at_[hidden])
Date: 1999-09-07 06:35:10


Dave,

>That's not the way I remember it. I thought the only explosion with DFAs
occurred in trying to reduce them to their most compact form... but you
don't have to do that. Anyway, I'm not expert in it. I hope someone will
correct me.<

There are several problems with dfa based implimentations -

Regular expressions as expressed by either POSIX or perl do not belong to
the set of regular languages - that is the expressions are not "regular"!

This shows itself in that the only operators allowed to make a truly
regular expression are:

literals, sets[], any repeat operator *, and alternation |.

Zero width assersions like ^ and $ can be added as a reasonably trivial
extention.

?,+ and {} - ie bounded repeats can be added at compile time cost by macro
exansion - for example:

a? becomes a| (that is "a" or <nothing>)
a+ becomes aa*
a{x,y} becomes "a" repeated x times followed by a? repeated y times
(careful ordering of states prevents repeated checking against "a" once one
mismatch occures).

The problem here is that (expression){x,y} reqires at least:

( states_in_expression * x) + ( 2 * states_in_expression * y)

states to expand out, so nested bounded repeats can quickly run your
machine out of swap space - as the space required grows exponentially.

There are ways to deal with this:

1) allow a maximum number of the states in the compiled dfa - any
expression which looks like going over that is likely to be pathological -
so refuse to compile. As long as the maximum can be reached in a reasonable
time we should be OK.
2) carry out compile on demand - only generate states for problematic cases
as they become required - this passes the problem on to the search phase -
as long as we don't have *both* a pathological expression, and a
pathological input string, we're OK. This is a complex approach, and would
still require a guard along the lines of (1).

OK now things start to get sticky:

Consider first sets that contain multicharacter collating elements - this
is important in two cases:

The locale contains digraphs that need to be collated as a single
character,
We're using Unicode and we want to do cannonical equivalence analysis (see
tr18, tr15 & tr10 at www.unicode.org)

[[.ae.]a] can be expanded to ae|[a]
[^[.ae.]a] is ambiguous (how many characters does it match?), but in any
event requires "lookahead".
[[.ae.]-b] is worse - how many characters do we match? how should we
represent this as a state machine? probably as ae|[[.ae.]-b], so that the
leading "ae" takes care of the endpoint, and the bit in the set matches
only one character if at all.

So far we can work around the problems, but the clincher are back
references: in order to match a back reference \1 we must have available
all the possible ways in which \1 could have matched - in effect we have a
classic NP-Hard problem. There is a proof at
http://www.plover.com/~mjd/perl/NPC/NPC-3SAT.html, although to be frank my
Maths ain't good enough to judge its correctness (I started life in
computational chemistry and moved over to programming from there), I can
make I good handwaving argument to support this, but it takes a while, so
maybe some other time :-)

Let me put this as an assertion: You cannot match back references using a
dfa.

If you can prove me wrong, I'll be surprised and happy in equal measure!

*************

All of the above ignores the main advantages of a dfa: you can guarentee
polynonial worst case performance, if you're involved in setting a standard
I can see that this would be (very) important.

*************

I would set three regular expression "levels":

1) full re includes back references - all bets are off - requires an nfa or
a crippled dfa that exhibits non-deterministic behaviour.
2) no back references, but full sub-expression matching - dfa's can do
this, but for carefully written expressions nfa's may be quicker (and yes
that's a subjective un-substantiated assertion!)
3) no back references, no sub expression matching, dfa's start to look
really good.
4) don't care at all what matched (ie grep), can use bit-parrellism to
speed up the state machine (see glimpse/agrep).

************

>the exact problem I had with the Perl-style match? I think it was that the
match could get "stuck" in a greedy match of an alternative in an
expression
made up of terms "or"ed together... something like that. In other words, it
is possible to write an expression of the form: A|B|C... (where A,B,C...
are
sub-expressions) such that it would fail to properly match text that would
have been matched by one of A,B,C... alone.<

Perl matches whatever it finds first. POSIX requires longest match, and
this is what regex++ does.

consider:

a*ab against "aab" : perl finds only "aaa", POSIX/regex++ finds "aaab"
a|ab against "ab" : perl finds only "a", regex++ finds "ab"
a+(ab)* against "aabab" perl finds "aa", regex++ finds "aabab"

If you want to experiment build the timer test app: you can then
interactively enter expressions and text input and see what happens (both
what matches and how long it takes) if nothing else you can explore the
boundaries of pathological behaviour for nfa's !

********

>I've not looked too closely at the details yet, but it would be good if
the
traits approach could be extended to allow for "drop in" alternative
regular
expression syntax and/or implementations. For a Java regex package which
has
this feature (by means of compiler and matcher "interfaces", and multiple
concrete implementations), see
http://www.quateams.com/oro/software/OROMatcher1.1.html.<

I'll take a look when I have a chance, the traits classes in regex++
although at version 2 are still very much imperfect - and hence
(deliberately) not extensively documented. The idea that the traits class
provides the implimentation is a good one - it would have to provide both
compile and matching algorithms though, the latter template based members
if we are to allow searching of iterators (and I think we should). If
necessary we could then have "perl_regex_traits", and "posix_regex_traits"
etc. Just a word of warning - there's a lot of work involved in such an
approach - maybe not to define a spec, but to carry out an implimentation.

Anyway, this is growing into a rather long mail - time to go,

Best regards,

John.

PS it occures to me that there are many other pattern matching algorithm,
which are not currently catered for (Knuth Morris Pratt, Boyer-Moore,
Multi-string searches etc), maybe what is needed is some kind of generic
"pattern" encapsulation, with match, search, grep etc operations applied to
it, just a thought.....


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk