Boost logo

Boost :

From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2021-07-21 16:34:41


On 7/21/21 6:19 PM, Phil Endecott via Boost wrote:
> Dear Experts,
>
> I recently wanted to add some input-sanitising code to a C++
> web application, and as I had some existing Javascript code
> I decided to port its regex-based input checking to the C++
> code. This didn't go as well as I had expected!
>
> Here's an example of the sort of thing I have:
>
> if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw
> MalformedInput();
>
> The Javascript equivalent seems to be fine. In C++, with
> libstdc++'s regex implementation, it seems to take about 2
> seconds to run. Boost.Regex didn't like the '-' after the '9'
> but when I fixed that the execution time became negligible.
> Boost.Xpressive seems also to be fast.

I haven't benchmarked regex implementations for a few years, but my past
experience was that libstdc++ std::regex indeed was very slow. I don't
remember the numbers, but my general impression was that all std::regex
implementations were slow compared to Boost.Regex, which seems to be one
of the fastest implementations. Boost.Xpressive is slower than
Boost.Regex, but faster than std::regex. Its main advantage was that it
is header-only. Basically, use Boost.Regex unless you really can't.

> On investigation it seems that this is due to there being two
> different algorithms, one of which (DFA) is linear in the string
> size but exponential in the pattern size (which in this case seems
> to mean exponential in the max repeat value of 8192), and the
> other (backtracking) is linear in the pattern size but exponential
> in the string size.
>
> It seems that the standard deliberately doesn't specify complexity
> guarantees for std::regex, leaving the choice of algorithm to the
> implementation. I find this a bit surprising. It's true that
> both algorithms have their strengths, but not knowing which
> you are going to get is the worst of both worlds - you would
> have to assume that it will be exponential in both pattern
> and string sizes in portable code! (Imagine a container that
> doesn't disclose whether it has vector or list complexity!)
>
> So I'm wondering if it would be useful for regex libraries to
> somehow be explicit about which algorithm they are and what
> the complexity is.
>
> Compare, for example, with the text search algorithms
> (Boyer-Moore etc) which indicate the algorithm in their names.
> Could we have regex_dfa / regex_backtracking, or something?
>
> Now having said that, I think that what I really need for my
> application is something that is worst-case linear in both the
> pattern size and the input string. I'm not sure if the
> apparently-fast Boost.Regex algorithm has a worst-case input
> that is slow. ("Regular Expression Denial Of Service"). It's
> clear that my example can be written without a regex as a
> simple linear loop, and maybe some regex libraries have
> heuristics that do that.
>
> Thoughts anyone?

Have you tried to cache the regex object? I suspect, parsing the regex
might have a non-negligible cost.

As for the complexity specification, it obviously depends on the regex.
I don't think it is possible to specify an abstract complexity estimate
for a regex library. The best you could have is to specify the
complexity of various elements of the expression - both when
constructing a regex and when applying it.

Complexity is not everything you need to know when you care about
performance. You should benchmark it anyway. It is not unusual to have a
higher complexity algorithm outperform a lower complexity one in some
conditions, which may happen to match yours. Even two algorithms having
the same formal complexity estimate may perform very differently in
practice. I remember Boost.Regex docs had some benchmark results,
although by now they are probably outdated.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk