Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2006-11-03 12:09:30


Luc LA. ALQUIER wrote:
> The boost documentation located (HYPERLINK
> "http://www.boost.org/libs/regex/doc/match_flag_type.html"http://www.boost.org/libs/regex/doc/match_flag_type.html)
> tell this :
>
>
>
> “match_extra Instructs the matching engine to retain all available
> HYPERLINK "http://www.boost.org/libs/regex/doc/captures.html"capture
> information; if a capturing group is repeated then information about
> every repeat is available via HYPERLINK
> "http://www.boost.org/libs/regex/doc/match_results.html#m17"match_results::captures()
> or HYPERLINK
> "http://www.boost.org/libs/regex/doc/sub_match.html#m8"sub_match_captures().
> “
>
>
>
> This feature was for me THE great feature that can provide a great
> way to link related information together.
>
> But the behavior using this flag with search (algorithm) was not the
> one expected (for me).
>
>
>
> Because instead of getting information about every repeat,
> sub_match_captures() contains all the captures obtained for
> corresponding sub-expression (as documentation HYPERLINK
> "http://www.boost.org/libs/regex/doc/sub_match.html"http://www.boost.org/libs/regex/doc/sub_match.html
> of sub_match’s captures member says).

I'm sorry you didn't find it clear: it contains information about every
sub-expression that was *matched*.

> For example (with use of named capture syntax (wich is not supported
> today in boost) to clarify regular expression):
>
>
>
> ^(?<time>[^ ]+)(?:
> (?<attr>[A-Za-z]+)=(?:"(?<qvalue>[^"]+)"|(?<svalue>[^ ]+)))+
>
>
>
> which intend to parse this kind of lines
>
>
>
> 12/05/2006_12:04:25 id=5 msg="this is a problem" user=paul
>
>
>
> captures for this example
>
> time={‘12/05/2006_12:04:25’}
>
> attr={‘id’,’msg’,’user’}
>
> qvalue={‘this is a problem’}
>
> svalue={‘5’,’paul’}
>
>
>
> and I was expecting
>
>
>
> time={‘12/05/2006_12:04:25’}
>
> attr={‘id’,’msg’,’user’}
>
> qvalue={null,‘this is a problem’,null}
>
> svalue={‘5’,null,’paul’}

Nod: understood. However there are a couple of problems here:

1) The underlying engine has no knowledge of whether one capturing group is
embedded "inside" another. This pretty much rules out tree-like structures
without a major rewrite.
2) If a capturing group is unmatched then the engine can't output an empty
string to the result array because it never "sees" the unmatched capture so
it has no knowledge that it has been skipped over.

> I’ve got “useless” data because we loose the data structure, no way
> to link paul to user neither to link “msg” to “this is a problem”.
>
>
>
> Sorry for my English that’s may be a starting point for
> misunderstanding, but it should be cool that documentation match
> specification and or behave like I was expecting.
>
>
>
> I understand that there a limitation to the behavior i was expecting
> since it does not take care of underneath structure if there is
> repeated group in repeated group.
>
>
>
> There is several way to prevent loosing these relationship between
> data (with different degree of relevance) :
>
> - build a hierarchical tree of capture (syntactical tree)
>
> - Provide iterator on all captures that keep track
> apparition’s order.

That's the only option that might be possible.

> - Allow named capture with duplicate group name.
>
>
>
> So, is it documentation to fix or a bug?

I don't think it's either, but I'll see if I can make the docs clearer.

Aside: one way to tackle the "maybe a string, or maybe not" problem would be
to use:

(")?(?(1)[^"]*|[^ ]*)(?(1)")

Hopefully I've typed that in correctly!

Regards, John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk