Hi,

I try to use pattern matching approach for a dataset with multilevel tags. Suppose we have 2  sets of features: F and H with the following tags F={f0, f1, f2, f3} and H={h0, h1, h2, h3, h4} to describe a set of elements E={e1, e2, e3.....en}. We have in a table a sequence of E elements describe as following:

E    e1   e2    e3   e4    e5  e6    e7      e8 ...........en
F     f3    f0     f0   f0    f2    f3     f0      f2              f0
H     h4   h0    h4  h1    h0   h2    h3     h2             h1

The order in which e1, e2 e3...... appear is important as words in a sentence.

Suppose we have the following RE with f3 and f0 as following: f3f0+ . It will match the corresponding sequence e1, e2, e3, e4 and e6, e7 in E. Now we want to add additional constraint as the last f0 should map a h1 in H. In this case the final result will be only the sequence e1, e2, e3, e4 because in the last sequence e6, e7 the f0 map h3 in H.The final result is always E sequences but the RE and constraint can be based on E, F or H. 
May be it becomes a bit clear. Thank for your help.

Regards
Olivier



2013/9/27 Anthony Foiani <tkil@scrye.com>
Olivier, greetings --

Olivier Austina <olivier.austina@gmail.com> writes:

> I am wondering if it is possible to run directly regular expression
> over a list and getting the indexes (begging and end) of the
> match. For example, I have a list of strings and a regular
> expression and I want to know which part of the list matches the RE
> and get the corresponding indexes in the list.

It looks like you got some other suggestions, but if they're not what
you're looking for, you might want to clarify your request.

In particular, it's not clear to me whether you want to match the RE
against each individual item (which can be parallelized, but the
result is a membership bitmap or subset, not a range), or if you want
to match the RE against the concatenated value, or the largest span of
continuous values (which are the requests that make the most sense if
you want a starting and ending index).

Some sample code would probably clarify things.

E.g., given:

  typedef std::vector< std::string > string_vec;
  const string_vec sv{ "foo", "bar", "baz" };

And you wanted to match:

  const boost::regex re{ "ba.*" };

What answer do you want to see?

  * The membership bitmap would be something like: [ 0, 1, 1 ]

  * The subset would be: { "bar", "baz" }
    (This is basically a "grep" operator.)

  * The "range" answer would be sv.begin()+1, sv.begin()+3 (since
    "barbaz" matches).

  * The other "range" answer would be the same, but because "bar"
    matches, and "baz" matches, so the range represents the (longest?)
    set of elements that individually match the given regex.  This
    ends up being something of a meta-regex, or if you prefer, the
    result of searching the concatenated string for instances of
    "(?:re)+".

Or is there some other interpretation that you're trying to get at?

Happy hacking,
Tony

p.s. Heh.  Guess who just finished a few interviews where "spot the
     under-specified problem" was important...
_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users