Boost logo

Boost Users :

From: Eric Niebler (eric_at_[hidden])
Date: 2008-01-26 14:53:44


Eric Niebler wrote:
> Aries Tao wrote:
>> hi everybody,I use boost.xpressive to search email address in a binary
>> file which size is 10*1024*1024 bytes. every bytes is 0x6f in that
>> file.boost.xpressive is inefficient. anyone can help me? thanks!
>> the code is below:
> <snip>

After I figured out what John's code was doing, it was a pretty simple
matter to implement a similar optimization in xpressive, which I have
just committed to trunk. It should dramatically speed up repeated
searches for patterns that begin with a simple repeat, such as "[a-z]+".
When used together with an independent subexpression to eliminate
unnecessary backtracking, the performance is really quite good.

I'm attaching a little benchmark program which uses your pattern to
search a 1Gb buffer. Here are the timings after my fix:

Boost.regex: 6.859
Boost.xpressive (dynamic): 3.02
Boost.xpressive (static): 2.912

With boost.regex, I couldn't use an independent subexpression because it
disables the optimization. That accounts for the difference.

Thanks for bringing this issue to my attention.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

///////////////////////////////////////////////////////////////////////////////
// main.hpp
//
// Copyright 2007 Eric Niebler. Distributed under the Boost
// Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#include <cstring>
#include <iostream>
#include <boost/timer.hpp>
#include <boost/regex.hpp>
#include <boost/xpressive/xpressive.hpp>

int main()
{
    std::size_t const Mb = 1000000000;
    char *begin = new char[Mb];
    char *end = begin + Mb;
    std::memset(begin, 0x6f, Mb);

    try
    {
        using namespace boost;
        cregex_iterator e;
        char const *pattern = "([a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)";
        regex token(pattern, (regex_constants::syntax_option_type)(regex_constants::ECMAScript | regex_constants::optimize));
        boost::timer tim;
        tim.restart();
        cregex_iterator cur(begin, end, token);
        for(; cur != e; ++cur)
            ;
        double elapsed = tim.elapsed();
        std::cout << "Boost.regex: " << elapsed << '\n';
    }
    catch(std::exception const &e)
    {
        std::cout << "boost.regex error: " << e.what() << std::endl;
    }

    try
    {
        using namespace boost::xpressive;
        cregex_iterator e;

        // test a dynamic regex
        char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+)@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)";
        cregex token = cregex::compile(pattern, regex_constants::ECMAScript | regex_constants::optimize);
        boost::timer tim;
        tim.restart();
        cregex_iterator cur(begin, end, token);
        for(; cur != e; ++cur)
            ;
        double elapsed = tim.elapsed();
        std::cout << "Boost.xpressive (dynamic): " << elapsed << '\n';

        // test a static regex
        token =
            (s1=
                keep(+set[range('a','z') | '#' | '~' | '_' | '.' | '!' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'])
>> '@'
>> +set[range('a','z') | '#' | '_' | '-']
>> '.'
>> +set[range('a','z') | '#' | '_' | '-' | '.']
            );
        tim.restart();
        cregex_iterator cur2(begin, end, token);
        for(; cur2 != e; ++cur2)
            ;
        elapsed = tim.elapsed();
        std::cout << "Boost.xpressive (static): " << elapsed << '\n';
    }
    catch(std::exception const &e)
    {
        std::cout << "boost.xpressive error: " << e.what() << std::endl;
    }

    return 0;
}


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net