Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2021-07-21 15:19:18


Dear Experts,

I recently wanted to add some input-sanitising code to a C++
web application, and as I had some existing Javascript code
I decided to port its regex-based input checking to the C++
code. This didn't go as well as I had expected!

Here's an example of the sort of thing I have:

if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();

The Javascript equivalent seems to be fine. In C++, with
libstdc++'s regex implementation, it seems to take about 2
seconds to run. Boost.Regex didn't like the '-' after the '9'
but when I fixed that the execution time became negligible.
Boost.Xpressive seems also to be fast.

On investigation it seems that this is due to there being two
different algorithms, one of which (DFA) is linear in the string
size but exponential in the pattern size (which in this case seems
to mean exponential in the max repeat value of 8192), and the
other (backtracking) is linear in the pattern size but exponential
in the string size.

It seems that the standard deliberately doesn't specify complexity
guarantees for std::regex, leaving the choice of algorithm to the
implementation. I find this a bit surprising. It's true that
both algorithms have their strengths, but not knowing which
you are going to get is the worst of both worlds - you would
have to assume that it will be exponential in both pattern
and string sizes in portable code! (Imagine a container that
doesn't disclose whether it has vector or list complexity!)

So I'm wondering if it would be useful for regex libraries to
somehow be explicit about which algorithm they are and what
the complexity is.

Compare, for example, with the text search algorithms
(Boyer-Moore etc) which indicate the algorithm in their names.
Could we have regex_dfa / regex_backtracking, or something?

Now having said that, I think that what I really need for my
application is something that is worst-case linear in both the
pattern size and the input string. I'm not sure if the
apparently-fast Boost.Regex algorithm has a worst-case input
that is slow. ("Regular Expression Denial Of Service"). It's
clear that my example can be written without a regex as a
simple linear loop, and maybe some regex libraries have
heuristics that do that.

Thoughts anyone?

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk