Boost logo

Boost Users :

Subject: Re: [Boost-users] Running Regular expression (RE) over a list, array or map
From: Anthony Foiani (tkil_at_[hidden])
Date: 2013-09-27 18:08:22


Olivier --

Olivier Austina <olivier.austina_at_[hidden]> writes:

> I try to use pattern matching approach for a dataset with multilevel
> tags. Suppose we have 2 sets of features: F and H with the following
> tags F={f0, f1, f2, f3} and H={h0, h1, h2, h3, h4} to describe a set of
> elements E={e1, e2, e3.....en}. We have in a table a sequence of E
> elements describe as following:
>
> E e1 e2 e3 e4 e5 e6 e7 e8 ...........en
> F f3 f0 f0 f0 f2 f3 f0 f2 f0
> H h4 h0 h4 h1 h0 h2 h3 h2 h1
>
> The order in which e1, e2 e3...... appear is important as words in a
> sentence.
>
> Suppose we have the following RE with f3 and f0 as following: f3f0+ .
> It will match the corresponding sequence e1, e2, e3, e4 and e6, e7 in
> E. Now we want to add additional constraint as the last f0 should map a
> h1 in H. In this case the final result will be only the sequence e1,
> e2, e3, e4 because in the last sequence e6, e7 the f0 map h3 in H.The
> final result is always E sequences but the RE and constraint can be
> based on E, F or H.

> May be it becomes a bit clear. Thank for your help.

This does make more sense.

I suspect that what you will want to do is to use a regex iterator for
each constraint that you're trying to apply. Each time a given
iterator matches, it will make a sequence (begin+end) available; you
can then apply whatever other constraints you need, to determine
whether you want to save (or otherwise note) the match.

Are F and H distinct types? If so, it gets ugly, because you need a
different regex template for each type you're trying to match against,
and those templates are not related (so you can't easily store an
arbitrary list of them). Maybe variant would help.

If you really only have two tags, a cheap-and-dirty trick is to try to
fit them into bitfields, then match on character sets. E.g.,

    f0 = 0x00 h0 = 0x00
    f1 0x10 h1 0x01
    f2 0x20 h2 0x02
    ... ...
    f15 0xf0 h15 0x0f

So your sample vector becomes:

  element: e1 e2 e3 e4 e5 ...
  tags: 0x 34 00 04 01 20 ...

And you can do your matches by slicing the bitfield either way:

  f3f0+ [\x30-\x3f][\x00-\x0f]+
  "last f0 should also be h1" .*?(:>\x01[^\x00-\x0f])

Or something like that, it's easier if you know that will always come
after the first filter, or you can combine them:

  f3f0+ where last f0 should also be h1 [\x30-\x3f][\x00-\x0f]*\x01

But this is insane, of course, and probably not applicable. (And
"\x00" in a string literal is just asking for trouble.)

You could play games with fixed-width representations as well,
depending on your exact requirements. Instead of having two rows, you
interleave the tags:

  E e1 e2 e3 e4 e5 e6 e7 e8 ...........en
  FH f3h4f0h0f0h4f0h1f2h0f3h2f0h3f2h2 f0h1

And you can then use the same tricks as above, but with '.' as a
"don't care" value:

 "f3f0+" becomes "(?:f3..)(?:f0..)+"

And the additional constraint is even more straightforward:

  "(?:f3..)(?:f0..)*(?:f0h1)"

This will probably be a bit less efficient, but...

If you really do want to use the regex machinery on the raw tag types,
I think you need to do all of this:

1. Create a boost::regex instantiation that knows about your tags,
   which probably means at least creating a traits class for each of
   your tags:

      http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/html/boost_regex/ref/concepts/traits_concept.html

2. Build something that will turn an iterator across your table into
   an iterator across just one row of that table, and vice-versa.

3. Make a struct or class that manages those two for a given tag type
   and row;

4. Make all those structs related by inheritance to a base struct (or
   use some sort of type erasure technique, or lambdas maybe.)

Whether it's worth all that work would depend on how many criteria you
want to express, IMHO.

(And even then, I'm not sure it will work gracefully, since regex
mingles data (regular characters) with control (meta characters).
Since your tags aren't really "characters" in this sense, I'm not sure
this will work.)

It might be that using Spirit might be easier, or doing the state
machine explicitly.

  
    


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net