Boost logo

Boost Users :

Subject: Re: [Boost-users] Running Regular expression (RE) over a list, array or map
From: Olivier Austina (olivier.austina_at_[hidden])
Date: 2013-09-28 04:29:54


Hi Anthony Foiani,
Thank you for the reply.
Yes, you understands what I am trying to do.
F and H are distinct types. It can have more than 2 types. Type are not
necessarily character, it can be string also. Is it easier to get indexes
with Spirit?
Regards
Olivier

2013/9/28 Anthony Foiani <tkil_at_[hidden]>

>
> Olivier --
>
> Olivier Austina <olivier.austina_at_[hidden]> writes:
>
> > I try to use pattern matching approach for a dataset with multilevel
> > tags. Suppose we have 2 sets of features: F and H with the following
> > tags F={f0, f1, f2, f3} and H={h0, h1, h2, h3, h4} to describe a set
> of
> > elements E={e1, e2, e3.....en}. We have in a table a sequence of E
> > elements describe as following:
> >
> > E e1 e2 e3 e4 e5 e6 e7 e8 ...........en
> > F f3 f0 f0 f0 f2 f3 f0 f2 f0
> > H h4 h0 h4 h1 h0 h2 h3 h2 h1
> >
> > The order in which e1, e2 e3...... appear is important as words in a
> > sentence.
> >
> > Suppose we have the following RE with f3 and f0 as following: f3f0+ .
> > It will match the corresponding sequence e1, e2, e3, e4 and e6, e7 in
> > E. Now we want to add additional constraint as the last f0 should map
> a
> > h1 in H. In this case the final result will be only the sequence e1,
> > e2, e3, e4 because in the last sequence e6, e7 the f0 map h3 in H.The
> > final result is always E sequences but the RE and constraint can be
> > based on E, F or H.
>
> > May be it becomes a bit clear. Thank for your help.
>
> This does make more sense.
>
> I suspect that what you will want to do is to use a regex iterator for
> each constraint that you're trying to apply. Each time a given
> iterator matches, it will make a sequence (begin+end) available; you
> can then apply whatever other constraints you need, to determine
> whether you want to save (or otherwise note) the match.
>
> Are F and H distinct types? If so, it gets ugly, because you need a
> different regex template for each type you're trying to match against,
> and those templates are not related (so you can't easily store an
> arbitrary list of them). Maybe variant would help.
>
> If you really only have two tags, a cheap-and-dirty trick is to try to
> fit them into bitfields, then match on character sets. E.g.,
>
> f0 = 0x00 h0 = 0x00
> f1 0x10 h1 0x01
> f2 0x20 h2 0x02
> ... ...
> f15 0xf0 h15 0x0f
>
> So your sample vector becomes:
>
> element: e1 e2 e3 e4 e5 ...
> tags: 0x 34 00 04 01 20 ...
>
> And you can do your matches by slicing the bitfield either way:
>
> f3f0+ [\x30-\x3f][\x00-\x0f]+
> "last f0 should also be h1" .*?(:>\x01[^\x00-\x0f])
>
> Or something like that, it's easier if you know that will always come
> after the first filter, or you can combine them:
>
> f3f0+ where last f0 should also be h1 [\x30-\x3f][\x00-\x0f]*\x01
>
> But this is insane, of course, and probably not applicable. (And
> "\x00" in a string literal is just asking for trouble.)
>
> You could play games with fixed-width representations as well,
> depending on your exact requirements. Instead of having two rows, you
> interleave the tags:
>
> E e1 e2 e3 e4 e5 e6 e7 e8 ...........en
> FH f3h4f0h0f0h4f0h1f2h0f3h2f0h3f2h2 f0h1
>
> And you can then use the same tricks as above, but with '.' as a
> "don't care" value:
>
> "f3f0+" becomes "(?:f3..)(?:f0..)+"
>
> And the additional constraint is even more straightforward:
>
> "(?:f3..)(?:f0..)*(?:f0h1)"
>
> This will probably be a bit less efficient, but...
>
> If you really do want to use the regex machinery on the raw tag types,
> I think you need to do all of this:
>
> 1. Create a boost::regex instantiation that knows about your tags,
> which probably means at least creating a traits class for each of
> your tags:
>
>
> http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/html/boost_regex/ref/concepts/traits_concept.html
>
> 2. Build something that will turn an iterator across your table into
> an iterator across just one row of that table, and vice-versa.
>
> 3. Make a struct or class that manages those two for a given tag type
> and row;
>
> 4. Make all those structs related by inheritance to a base struct (or
> use some sort of type erasure technique, or lambdas maybe.)
>
> Whether it's worth all that work would depend on how many criteria you
> want to express, IMHO.
>
> (And even then, I'm not sure it will work gracefully, since regex
> mingles data (regular characters) with control (meta characters).
> Since your tags aren't really "characters" in this sense, I'm not sure
> this will work.)
>
> It might be that using Spirit might be easier, or doing the state
> machine explicitly.
>
>
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net