Boost logo

Boost Users :

From: Eric Niebler (eric_at_[hidden])
Date: 2006-04-17 13:16:58


Lynn Allan wrote:
>>>Stuart Dootson wrote:
>>>Lynn - it might be interesting to use re2c (http://re2c.org/) to
>>>generate regex machines and see what their performance is like -
>>>after
>>>all, they do clain 're2c generates warning free code that is equal
>>>to
>>>hand-written code in terms of size, speed and quality.' (I'm
>>>offering
>>>no opinion on the veracity of that statement....), so (allegedly)
>>>they'd give a reasonable hand-code analogue....
>>
>>Lynn Allan wrote:
>>I took a look at it, and it looks like it could possibly be quite
>>useful if executable size was a concern .... but it is pretty
>>obscure
>
>
> re2c seems very fast, and once you get the hang of it, not that
> difficult to use. For the contrived test of looking for DayOfWeek
> within a 1kb buffer, here are some results ... ymmv (WinXp-Sp2
> Microsoft ToolKit-2003 C++ Compiler with /O2 optimize flag)
>
> regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)";
>
> By-hand parser: Elapsed for 10000 loops Ms: 110.583
> By-hand parser: Elapsed for 100000 loops Ms: 1107.81
>
> re2c generator Elapsed for 10000 loops Ms: 69.4683
> re2c generator Elapsed for 100000 loops Ms: 700.546
>
> Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 410.492
> Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4164.45
>
> Boost::regex: Elapsed for 10000 loops Ms: 622.377
> Boost::regex: Elapsed for 100000 loops Ms: 5965.24

Interesting! Can you confirm that re2c is not handling backreferences?
That is, after a match, is there a way to access what the Nth group
matched? Also, do you think you could send around the code that re2c is
generating for this expression?

I'm guessing that re2c is generating a DFA. Boost.Regex and Xpressive
generate NFAs because DFAs aren't suited to doing backreferences[*].
I've considered adding DFA support to Xpressive, and use DFAs for those
regexes that don't need the full power of NFAs. Clearly, the performance
win would be worth the trouble. This would not be a trivial undertaking,
however.

[*] Technically, DFAs only have a problem with patterns such as
"(.)\\1"; that is, when the result of the backreference is used within
the pattern itself.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net