Boost logo

Boost Users :

From: Lynn Allan (l_d_allan_at_[hidden])
Date: 2006-04-18 21:56:06


> Interesting! Can you confirm that re2c is not handling
> backreferences?
> That is, after a match, is there a way to access what the Nth group
> matched? Also, do you think you could send around the code that re2c
> is generating for this expression?

This regex newbie is ignorant about what it means to "back-reference."
My usage (so far) has been recognizing one type of expression at a
time, like:
((Sunday|Sun)|(Monday|Mon) ...etc.(Saturday|Sat))
or
ZipCode ##### or #####-####

It seems to be able to get part way thru the longer possibility, and
then "settle for" the shorter "abbreviation". That probably isn't
back-referencing.

ZipCode 00000-000 is "almost" #####-####, but isn't, so it recognizes
#####.
MondX is "almost" Monday, but isn't, so it recognizes Mon, which is
also group=2.

It is relatively straightforward to get the results of a single match
and do what you want with it ... I don't have experience with
untangling something more complicated like a multi-piece date/time:
MMM DDD, YYYY hh:mm:ss [ap].m

> I'm guessing that re2c is generating a DFA. Boost.Regex and
> Xpressive
> generate NFAs because DFAs aren't suited to doing backreferences[*].
> I've considered adding DFA support to Xpressive, and use DFAs for
> those regexes that don't need the full power of NFAs. Clearly, the
> performance win would be worth the trouble. This would not be a
> trivial undertaking, however.
>
> [*] Technically, DFAs only have a problem with patterns such as
> "(.)\\1"; that is, when the result of the backreference is used
> within
> the pattern itself.

Re2c generates VERY gnarly code, full of goto's and labels ... but the
compiler is happy and the optimizer seems to straighten everything out
into fast object code. Re2c is apparently part of PHP.

yy4:
      yych = *++YYCURSOR;
      goto yy3;
yy5:
      yych = *++YYCURSOR;
      if(yych <= '/') goto yy6;
      if(yych <= '9') goto yy7;
yy6:
      YYCURSOR = YYMARKER;
      switch(yyaccept){
      case 1: goto yy10;
      case 0: goto yy3;
      }
yy7:
      yych = *++YYCURSOR;
      if(yych <= '/') goto yy6;
      if(yych >= ':') goto yy6;
      yych = *++YYCURSOR;
      if(yych <= '/') goto yy6;


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net