Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2005-09-17 13:13:38


Eric, my apologies for not posting before, I really must get around to a
full review, but in the mean time:

> In short, xpressive comes out consistently ahead of Boost.Regex on short
> matches, and roughly on par for longer matches (with wide variation).
> Results are shown for both gcc 3.4 and vc7.1. The xpressive download
> includes the code for the perf test, so you can run it yourself, if you
> like.

I'm attaching the full results I get from my performance test suite (updated
to reflect the changes you made in the xpressive one, but without the static
expressions: I couldn't be bothered to translate them all!). I compiled
with VC7.1 with all the optimisations on (any suitable inline, global
optimisations on, inline intrinsics on). Be aware that any such results
should be treated with extreme caution: there is absolutely no such thing as
a "typical" regex. I haven't run the tests with cygwin: in the past I've
found cygwin to be a pretty poor platform for testing code performance, I
don't think the cygwin guys have put much effort into runtime efficiency, as
compared to Linux say.

For the "trivial" short matches test, I see broadly very similar results
from xpressive and Boost.Regex, I haven't counted them, but it looks like
honors even to me. The bad news is that PCRE kicks us both into touch on
this test, however I'm pretty sure that Boost.Regex only dropped behind PCRE
on this test section, when I added protection againt stack overflow (a __try
__except block). Since I pinched this idea from GRETA, I suspect xpression
does the same thing? In any case that protection is a "good thing" so it's
staying in Boost.Regex whatever.

For the html document search test cases xpressive is consistently ahead: by
up 5x, probably the Boyer-Moore code kicking in (but see below). On this
test PCRE is somewhere in between us.

For the C++ code search test cases, Boost.Regex is consistently ahead by
about 2x (it beats PCRE on this test as well), there is one complicated
expression that xpressive didn't compile, but which Boost.Regex and PCRE did
handle OK.

For the plain text search test cases, Boost.Regex is ahead of xpressive in
all cases, up to 12x faster in the short text, and 25x in the very long text
(these are the extremes though, don't pay too much attention!).

The really interesting thing about these tests, is that some of the
expressions are string literals: the xpressive Boyer Moore code should
really shine here, and yet it comes out quite a bit slower. I'm not
completely sure what's happening here, but I would guess that the
Boost.Regex code is easier to optimise, and so executes faster, *provided*
it doesn't find a tentative match too often. In comparison in the html
tests, most of the regular expressions begin with "<", which tends to occur
rather often in html. So in this case, a better algorithm wins out over the
easier to optimise code.

Ideally there would be an heuristist that would choose between the two, and
always pick the best, but without analysing the text that you're going to
search first I don't what it would be at present :-(

Finally, I would note that in the search tests the Boost.Regex results are
hampered compared to PCRE by the need to construct a regex_iterator: the
class is a pimple which uses a shared_ptr, so there's a couple of memory
allocations in there that PCRE doesn't have to do. For short searches this
a big hit, but *only* in the rarified atmosphere of a test suite, back in
the real world iterators tend to get passed around by value, so this is a
another "good thing" in general. I haven't looked at PCRE, but I suspect
you do something similar?

John.


Regular Expression Performance Comparison

   The following tables provide comparisons between the following regular expression libraries:

   [1]GRETA.

   [2]The Boost regex = library.

   [3]Henry Spencer's regular = expression library - this is provided for
   comparison as a typical non-backtracking = implementation.

   Philip Hazel's [4]PCRE = library.

  Details

   Machine: Intel Pentium 4 2.8GHz PC.

   Compiler: Microsoft Visual C++ version 7.1.

   C++ Standard Library: Dinkumware standard library version = 313.

   OS: Win32.

   Boost version: 1.33.0.

   PCRE version: 4.5.

   As ever care should be taken in interpreting the results, only sensible regular expressions (rather than pathological cases) are
   given, most = are taken from the Boost regex examples, or from the
   [5]Library of Regular Expressions. In addition, some variation in
   the = relative performance of these libraries can be expected on
   other = machines - as memory access and processor caching effects
   can be quite large for = most finite state machine algorithms.

  Averages

   The following are the average relative scores for all the = tests: the
   perfect regular expression library would score 1, in practice anything less than 2 is pretty good.

   Boost + C++ locale PCRE Dynamic Xpressive
   1.71231 1.75638 4.97007

  Comparison 1: Long Search

   For each of the following regular expressions the time taken to = find
   all occurrences of the expression within a long English language text was measured ([6]mtent12.txt<= /a> from [7]Project Gutenberg,
   = 19Mb).

   Expression Boost + C++ locale PCRE Dynamic Xpressive
   Twain 2.13
   (0.12s) 1
   (0.0563s) 8.54
   (0.481s)
   Huck[[:alpha:]]+ 2.35
   (0.119s) 1
   (0.0507s) 10.5
   (0.53s)
   [[:alpha:]]+ing 1
   (1.06s) 5.71
   (6.07s) 9.28
   (9.85s)
   ^[^ ]*?Twain 1
   (0.305s) 2.59
   (0.791s) 9.62
   (2.93s)
   Tom|Sawyer|Huckleberry|Finn 1
   (0.138s) 1.44
   (0.198s) 18.2
   (2.5s)
   = (Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawy er|Huckleberry|Finn) 1
   (0.285s) 2.32
   (0.66s) 25
   (7.12s)

  Comparison 2: Medium Sized Search

   For each of the following regular expressions the time taken to = find
   all occurrences of the expression within a medium sized English language text was measured (the first 50K from mtent12.txt).

   Expression Boost + C++ locale PCRE Dynamic Xpressive
   Twain 1.21
   (0.000372s) 1
   (0.000308s) 4.07
   (0.00125s)
   Huck[[:alpha:]]+ 1.58
   (0.000317s) 1
   (0.0002s) 6.93
   (0.00139s)
   [[:alpha:]]+ing 1
   (0.00227s) 6.76
   (0.0153s) 11.6
   (0.0263s)
   ^[^ ]*?Twain 1
   (0.000851s) 2.44
   (0.00207s) 8.83
   (0.00751s)
   Tom|Sawyer|Huckleberry|Finn 1
   (0.000704s) 1.89
   (0.00133s) 6.78
   (0.00477s)
   = (Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawy er|Huckleberry|Finn) 1
   (0.00149s) 1.71
   (0.00254s) 12.4
   (0.0185s)

  Comparison 3: C++ Code Search

   For each of the following regular expressions the time taken to = find
   all occurrences of the expression within the C++ source file
   [8]boost/crc.hpp was measured.

   Expression Boost + C++ locale PCRE Dynamic Xpressive
   = ^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struc t)[[:space:]]*(\<\w+\>([ ]*\([^)]*\))?[[:space:]]*)*(\<\w*\>)[[:space:]]*(<[^;:{]+>[[: space:]]*)?(\{|:[^;\{()]*\{) 1
   (0.000235s) 2.79
   (0.000655s) 2.5
   (0.000587s)
   (^[ ]*#(?:[^\\\n]|\\[^\n_[:punct:][:alnum:]]*[\n[:punct:][:word:]])*)|(//[
   ^\n ]*|/\*.*?\*/)|\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)?[[:
      digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>
      |('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\<(__asm|__cdecl|__declspec|__e
      xport|__far16|__fastcall|__fortran|__import|__pascal|__rtti|__stdcall|
   _as m|_cdecl|__except|_export|_far16|_fastcall|__finally|_fortran|_import|
   _pa scal|_stdcall|__thread|__try|asm|auto|bool|break|case|catch|cdecl|char
   |cl ass|const|const_cast|continue|default|delete|do|double|dynamic_cast|el
   se| enum|explicit|extern|false|float|for|friend|goto|if|inline|int|long|mu
   tab le|namespace|new|operator|pascal|private|protected|public|register|rei
   nte rpret_cast|return|short|signed|sizeof|static|static_cast|struct|switch
   |te mplate|this|throw|true|try|typedef|typeid|typename|union|unsigned|usin
   g|v= irtual|void|volatile|wchar_t|while)\> 1
   (0.0097s) 2.32
   (0.0225s) NA
   ^[ ]*#[ ]*include[ = ]+("[^"]+"|<[^>]+>) 1
   (0.000367s) 1.55
   (0.000567s) 2.45
   (0.000899s)
   ^[ ]*#[ ]*include[ = ]+("boost/[^"]+"|<boost/[^>]+>) 1
   (0.000367s) 1.55
   (0.000567s) 2.45
   (0.000899s)

  Comparison 4: HTML Document Search

   For each of the following regular expressions the time taken to = find
   all occurrences of the expression within the html file
   [9]libs/libraries.htm was measured.

   Expression Boost + C++ locale PCRE Dynamic Xpressive
   beman|john|dave 1
   (0.000519s) 1.3
   (0.000675s) 2.87
   (0.00149s)
   <p>.*?</p> 2.13
   (0.000406s) 1.87
   (0.000357s) 1
   (0.000191s)
   = <a[^>]+href=("[^"]*"|[^[:space:]]+)[^>]*><= /td> 2.12
   (0.00141s) 1
   (0.000665s) 1.71
   (0.00113s)
   = <h[12345678][^>]*>.*?</h[12345678]> 2.37
   (0.000435s) 1.76
   (0.000323s) 1
   (0.000183s)
   = <img[^>]+src=("[^"]*"|[^[:space:]]+)[^>]*>= 4.67
   (0.000411s) 3.33
   (0.000293s) 1
   (8.8e-005s)
   = <font[^>]+face=("[^"]*"|[^[:space:]]+)[^>]*>.*?&l= t;/font> 4.19
   (0.000425s) 2.89
   (0.000293s) 1
   (0.000101s)

  Comparison 3: Simple Matches

   For each of the following regular expressions the time taken to match against the text indicated was measured.

   Expression Text Boost + C++ locale PCRE Dynamic Xpressive
   abc abc 2.29
   (4.2e-007s) 1
   (1.84e-007s) 1.32
   (2.43e-007s)
   ^([0-9]+)(\-| |$)(.*)$ 100- this is a line of ftp response which
   contains a = message string 2.04
   (9.54e-007s) 1
   (4.67e-007s) 1.43
   (6.69e-007s)
   ([[:digit:]]{4}[- = ]){3}[[:digit:]]{3,4} 1234-5678-1234-456 2.5
   (1.34e-006s) 1
   (5.34e-007s) 5.08
   (2.71e-006s)
   = ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)| (([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
   john_at_[hidden] 1.59
   (4.12e-006s) 1
   (2.6e-006s) 1.76
   (4.58e-006s)
   = ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)| (([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ foo12_at_[hidden]
   1.65
   (3.59e-006s) 1
   (2.18e-006s) 1.72
   (3.74e-006s)
   = ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)| (([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
   bob.smith_at_[hidden] 1.68
   (3.59e-006s) 1
   (2.14e-006s) 1.79
   (3.81e-006s)
   ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} = {0,1}[0-9][A-Za-z]{2}$ EH10 2QQ
   1.78
   (1.36e-006s) 1
   (7.64e-007s) 1.52
   (1.16e-006s)
   ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} = {0,1}[0-9][A-Za-z]{2}$ G1 1AA
   1.77
   (1.34e-006s) 1
   (7.54e-007s) 1.54
   (1.17e-006s)
   ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} = {0,1}[0-9][A-Za-z]{2}$ SW1 1ZZ
   1.73
   (1.32e-006s) 1
   (7.64e-007s) 1.52
   (1.16e-006s)
   = ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$ 4/1/2001 1.85
   (1.2e-006s) 1
   (6.49e-007s) 2
   (1.3e-006s)
   = ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$ 12/12/2001 1.88
   (1.2e-006s) 1
   (6.4e-007s) 2.03
   (1.3e-006s)
   ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ 123 1.76
   (1.26e-006s) 1
   (7.16e-007s) 1.25
   (8.97e-007s)
   ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ +3.14159 1.68
   (1.3e-006s) 1
   (7.74e-007s) 1.68
   (1.3e-006s)
   ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ -3.14159 1.71
   (1.34e-006s) 1
   (7.83e-007s) 1.63
   (1.28e-006s)
     _________________________________________________________________

   © Copyright John Maddock 2005

   Use, modification and distribution are subject to the Boost = Software
   License, Version 1.0. (See accompanying file [10]LICENSE_1_0.txt or
   copy at [11]http://www.boost.org/LICENS= E_1_0.txt)

References

   1. 3D"http://research.microsoft.com/projects/greta"
   2. 3D"http://www.boost.org/"
   3. 3D"http://arglist.com/regex/"
   4. 3D"http://www.pcre.org"/
   5. 3D"http://www.regxlib.com/"
   6. 3D"http://www.gutenberg.org/files/3200/old/mtent12.zip"
   7. 3D"http://promo.net/pg/"
   8. file://localhost/boost/crc.hpp"
   9. file://localhost/tmp/libraries.htm"
  10. file://localhost/LICENSE_1_0.txt"
  11. 3D"http://www.boost.org/LICENSE_1_0.txt"


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk