|
Boost Users : |
From: Eric Niebler (eric_at_[hidden])
Date: 2008-01-24 14:57:50
Aries Tao wrote:
> hi everybody,I use boost.xpressive to search email address in a binary
> file which size is 10*1024*1024 bytes. every bytes is 0x6f in that
> file.boost.xpressive is inefficient. anyone can help me? thanks!
> the code is below:
<snip>
I've done some investigation, and I've discovered a couple of things...
- The Boost.Regex algorithms that do not accept a match_results struct
automatically set the match_any flag. I don't see anything about that in
TR1. Compliance issue? Or is this covered by the as-if rule?
- The match_any flag has a dramatic adverse effect on the performance of
some regular expressions because of the following lines in
perl_matcher_non_recursive.hpp:
line 725:
//
// start by working out how much we can skip:
//
bool greedy = (rep->greedy) && (!(m_match_flags &
regex_constants::match_any) || m_independent);
line 753:
if(greedy)
{
if((rep->leading) && (count < rep->max))
restart = position;
This last line causes Boost.Regex to avoid exponential behavior. I don't
yet understand how this optimization works, or why match_any disables
it. John?
- Boost.Regex's regex_iterator invokes regex_search with a
match_results, so match_any is not set and the optimization kicks in. If
you pass the match_any flag to the regex_iterator constructor, the
constructor takes exponential time and throws on memory exhaustion.
- Boost.Xpressive lacks this optimization, and so always takes
exponential time, but does not exhaust memory and does not throw.
I'm attaching some code that demonstrates this strange situation. John,
could you comment on match_any and the re_repeat::leading optimization,
and why the regex algorithms are setting match_any?
Aries, your pattern presents a challenge to any backtracking regex
engine. Clever optimizations aside, there are some ways you can improve
the situation. In particular, think about wrapping your repeats in
independent sub-expressions like (?>[...]+). Unfortunately in this case,
that doesn't completely solve the problem because regex_search will
attempt a brute-force match at each of the 1 million positions within
the string. And, because of your pattern and the text against which
you're matching, each match will walk from the current position to the
end of the string. That's just going to be very, very slow.
-- Eric Niebler Boost Consulting www.boost-consulting.com
///////////////////////////////////////////////////////////////////////////////
// main.hpp
//
// Copyright 2007 Eric Niebler. Distributed under the Boost
// Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <cstring>
#include <iostream>
//#include <boost/regex.hpp>
#include <boost/xpressive/xpressive.hpp>
int main()
{
std::size_t const Mb = 1048576; // 1Mb
char *begin = new char[Mb];
char *end = begin + Mb;
std::memset(begin, 0x6f, 1048576);
char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+)@(?>[a-z#_\\-]+)\\.(?>[a-z#_\\-\\.]+))";
//try
//{
// using namespace boost;
// regex token(pattern);
// // fast, doesn't throw:
// cregex_iterator cur(begin, end, token);
// // slow, throws on memory exhaustion:
// regex_search(begin, end, token);
//}
//catch(std::exception const &e)
//{
// std::cout << "boost.regex error: " << e.what() << std::endl;
//}
try
{
using namespace boost::xpressive;
cregex token = cregex::compile(pattern, regex_constants::optimize);
// fast, doesn't throw:
cregex_iterator cur(begin, end, token);
// slow, throws on memory exhaustion:
//regex_search(begin, end, token);
}
catch(std::exception const &e)
{
std::cout << "boost.xpressive error: " << e.what() << std::endl;
}
return 0;
}
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net