Boost logo

Boost :

From: John Maddock (jz.maddock_at_[hidden])
Date: 2021-07-21 16:47:11

On 21/07/2021 16:19, Phil Endecott via Boost wrote:
> Dear Experts,
> I recently wanted to add some input-sanitising code to a C++
> web application, and as I had some existing Javascript code
> I decided to port its regex-based input checking to the C++
> code. This didn't go as well as I had expected!
> Here's an example of the sort of thing I have:
> if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw
> MalformedInput();

I would definitely cache the regex object in a case like that -
constructing and analyzing the expression is quite a significant task.

> The Javascript equivalent seems to be fine. In C++, with
> libstdc++'s regex implementation, it seems to take about 2
> seconds to run. Boost.Regex didn't like the '-' after the '9'
> but when I fixed that the execution time became negligible.
> Boost.Xpressive seems also to be fast.
> On investigation it seems that this is due to there being two
> different algorithms, one of which (DFA) is linear in the string
> size but exponential in the pattern size (which in this case seems
> to mean exponential in the max repeat value of 8192), and the
> other (backtracking) is linear in the pattern size but exponential
> in the string size.
> It seems that the standard deliberately doesn't specify complexity
> guarantees for std::regex, leaving the choice of algorithm to the
> implementation. I find this a bit surprising. It's true that
> both algorithms have their strengths, but not knowing which
> you are going to get is the worst of both worlds - you would
> have to assume that it will be exponential in both pattern
> and string sizes in portable code! (Imagine a container that
> doesn't disclose whether it has vector or list complexity!)
Complexity for regular expression is really really hard to specify - in
particular matching Perl style regular expressions are known to
NP-complete in the worst case, but fortunately the worst case is rarely
encountered, and in the most normal usage, Perl regexes are reasonably
> So I'm wondering if it would be useful for regex libraries to
> somehow be explicit about which algorithm they are and what
> the complexity is.
> Compare, for example, with the text search algorithms
> (Boyer-Moore etc) which indicate the algorithm in their names.
> Could we have regex_dfa / regex_backtracking, or something?
> Now having said that, I think that what I really need for my
> application is something that is worst-case linear in both the
> pattern size and the input string. I'm not sure if the
> apparently-fast Boost.Regex algorithm has a worst-case input
> that is slow. ("Regular Expression Denial Of Service").

What Boost.Regex has that other libraries do not (so far as I know), is
a cap on the maximum number of machine states which may be visited while
attempting a match.  This is quite deliberately to prevent DOS attacks,
and was built into the library from the start.  The limit is set at
approximately O(N^2) before a runtime error is triggered.

HTH, John.

> It's
> clear that my example can be written without a regex as a
> simple linear loop, and maybe some regex libraries have
> heuristics that do that.
> Thoughts anyone?
> Regards, Phil.
> _______________________________________________
> Unsubscribe & other changes:

This email has been checked for viruses by Avast antivirus software.

Boost list run by bdawes at, gregod at, cpdaniel at, john at