Boost logo

Boost :

From: John Maddock (jz.maddock_at_[hidden])
Date: 2021-07-21 16:47:11


On 21/07/2021 16:19, Phil Endecott via Boost wrote:
> Dear Experts,
>
> I recently wanted to add some input-sanitising code to a C++
> web application, and as I had some existing Javascript code
> I decided to port its regex-based input checking to the C++
> code. This didn't go as well as I had expected!
>
> Here's an example of the sort of thing I have:
>
> if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw
> MalformedInput();

I would definitely cache the regex object in a case like that -
constructing and analyzing the expression is quite a significant task.

>
> The Javascript equivalent seems to be fine. In C++, with
> libstdc++'s regex implementation, it seems to take about 2
> seconds to run. Boost.Regex didn't like the '-' after the '9'
> but when I fixed that the execution time became negligible.
> Boost.Xpressive seems also to be fast.
>
> On investigation it seems that this is due to there being two
> different algorithms, one of which (DFA) is linear in the string
> size but exponential in the pattern size (which in this case seems
> to mean exponential in the max repeat value of 8192), and the
> other (backtracking) is linear in the pattern size but exponential
> in the string size.
>
> It seems that the standard deliberately doesn't specify complexity
> guarantees for std::regex, leaving the choice of algorithm to the
> implementation. I find this a bit surprising. It's true that
> both algorithms have their strengths, but not knowing which
> you are going to get is the worst of both worlds - you would
> have to assume that it will be exponential in both pattern
> and string sizes in portable code! (Imagine a container that
> doesn't disclose whether it has vector or list complexity!)
Complexity for regular expression is really really hard to specify - in
particular matching Perl style regular expressions are known to
NP-complete in the worst case, but fortunately the worst case is rarely
encountered, and in the most normal usage, Perl regexes are reasonably
performant.
>
> So I'm wondering if it would be useful for regex libraries to
> somehow be explicit about which algorithm they are and what
> the complexity is.
>
> Compare, for example, with the text search algorithms
> (Boyer-Moore etc) which indicate the algorithm in their names.
> Could we have regex_dfa / regex_backtracking, or something?
>
> Now having said that, I think that what I really need for my
> application is something that is worst-case linear in both the
> pattern size and the input string. I'm not sure if the
> apparently-fast Boost.Regex algorithm has a worst-case input
> that is slow. ("Regular Expression Denial Of Service").

What Boost.Regex has that other libraries do not (so far as I know), is
a cap on the maximum number of machine states which may be visited while
attempting a match.  This is quite deliberately to prevent DOS attacks,
and was built into the library from the start.  The limit is set at
approximately O(N^2) before a runtime error is triggered.

HTH, John.

> It's
> clear that my example can be written without a regex as a
> simple linear loop, and maybe some regex libraries have
> heuristics that do that.
>
> Thoughts anyone?
>
>
> Regards, Phil.
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk