Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2005-06-09 06:13:07


> well, since what[] is already an array of n subexpressions, during finding
> of matches, I suppose, you set all the fields of what[i], if the i-th
> subexpression matched. While you're doing this, you could also store the
> index i into another structure of what.
>
> I hope I explained this correctly...

Yes, but it's not that simple...

1) As well as adding matches to the list, matches can become "unmatched"
during backtracking.
2) The time taken for a match is completely dominated by the time taken for
calls to new and delete, especially for simple expressions a call to new is
about 10x as long as the time to find a match, so if you reuse your
match_results structures (so the matcher doesn't have to allocate any
memory), then matching is extremely fast. Using a dynamic memory structure
(like std::set or std::map) to store just those sub-expressions that matched
would completely blow this out of the water. And yes, for some folks this
is very important....

An alternative would be to extend the existing structure to hold the extra
information, the trouble is, almost any way I can think of arranging this
will have rather a negative speed impact (don't forget you have to remove as
well as add entries). Probably the fastest method is to just post-process
the results by doing a linear scan through them, and that's something you
can do right now! You'd need hundreds of sub-expressions before this became
unduly inefficient - scanning through a vector really is very fast.

There is one option I can think of that might work: add a linked list to the
existing submatches, so that each links forward to the next matched
subexpression and back to the previous matched subexpression. However, this
means that only bi-directional (not random access ) would be possible to
"just the matched" sub-expressions. It would also double the size of the
sub_match structure, which itself has important implications: because these
structures are pushed into the stack (on Windows at least), it basically
halves the maximum complexity of expression you can match, more excessive
memory usage also has a performance impact. For this reason the capturing
code (see repeated captures in libs/regex/doc/captures.html) isn't enabled
by default: it also doubles the size of sub_match, with all the issues that
raises.

Sorry if I'm being unduly pessimistic, but in all honesty, the most
efficient method really is to just scan the match_results structure after
the event, this has the advantage as well that you don't pay for what you
don't use.

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk