Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2004-04-07 05:54:41


> Actually, I was asking about initial construction cost, in particular
> of an object representing a failed match. The acceptance of N1610
> means that copy costs should be insignificant for cases like this one,
> provided that the smatch author puts in the required effort to make it
> moveable. ;-)

Sounds like a hint - maybe if we could make shared_ptr moveable then we
could all delegate the work to that :-)

As for the initial construction cost - yes there is a cost - it has to
allocate memory to store the sub-expression matches, the matcher needs some
working space and therefore starts storing the submatches before it knows
whether there will be a match.

Consider your current code:

    std::string line;
    boost::regex pat("^Subject: (Re: )?(.*)");
    boost::smatch matches;

    while (std::cin)
    {
        std::getline(std::cin, line);
        if (boost::regex_match(line,matches, pat))
            std::cout << matches[2];
    }

The first time regex_match gets called it allocates the storage it needs in
the match_results class, subsequent calls then re-use this storage. This is
efficient - in fact the cost of a single memory allocation is about 10 times
that of a simple regex_match attempt - so this is very important IMO. In
fact I've spent a lot of last year eliminating unnecessary memory
allocations from regex, and there are some more I intend to stamp on this
year. Believe me it makes a difference, and other libraries like GRETA and
PCRE have all been through the same process and for the same reasons. In
contrast if regex_match returns a match_results structure then you
effectively "pessimise" the performance for a small improvement in ease of
use (although I admit that there are options similar to the small-string
optimisation that *might* be applicable here).

BTW, just to be hyper critical, your alternative code:

    std::string line;
    boost::regex pat("^Subject: (Re: )?(.*)");

    while (std::cin)
    {
        std::getline(std::cin, line);
        if (boost::smatch m = boost::regex_match(line, pat))
            std::cout << m[2];
    }

contains an assignment inside a while loop, which while "neat", I have often
seen criticised for being potentially error prone, there are even some
compilers that throw out a helpful(!) warning if you do that (along the
lines of "didn't you want to use operator==).

> > One other thing - the current regex_match overload that doesn't take
> > a match_results as a parameter currently returns bool - the intent
> > is that if the user doesn't need the info generated in the
> > match_results, then some time can be saved by not storing it.
> > Boost.Regex doesn't currently take advantage of that, but I was
> > planning to in the next revision (basically you can cut out memory
> > allocation altogether, and that's an order or magnitude saving).
>
> But I do need the match results, when the match succeeds.

I understand that, but there is a group of users who don't - one example is
a (commercial) email spam-filter that uses Boost.Regex. It only needs a
true/false result "does this message have this pattern or not", and it wants
the answer as fast as possible. For uses like this even a small change in
performance can make the difference between "coping" and "not coping" with
the email traffic they're seeing these days.

> I guess my original suggestion of making it implicitly convertible to
> some safe_bool solves that problem. I guess I prefer that idea,
> though Allan probably has more experience with this than I do.

OK, let me mull this over, maybe we can find a way to keep everyone happy,
maybe not ...

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk