Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2021-07-23 14:45:52


Thanks all for your replies.

John Maddock wrote:
> Complexity for regular expression is really really hard to specify

I disagree. Pretty much by definition, Regular Expressions can
be matched in time linear in the length of the string and that's
what I'd expect a std::c++ spec to require, just as it requires
sorting to be N log N etc. etc. The fact that Perl >bleurgh< chose
to provide what I'd call "irregular expressions" 25 years ago
doesn't mean that we had to copy that. I reject the idea that we
have to support back-references in patterns because "people expect
that" - I've never used them.

Dominique Devienne wrote:
> Have you tried Russ Cox's RE2 from https://github.com/google/re2

Thanks for the pointer. He has a series of three really good
documents explaining the history of regular expression implementations
and how we've slipped into this hole where the most common
implementations have poor complexity guarantees. See:

https://swtch.com/~rsc/regexp/

This also prompted me to look at CTRE, which compiles regular
expressions from strings at compile time and even works pre-C++20.
See: https://github.com/hanickadot/compile-time-regular-expressions

Gavin Lambert wrote:
> In pretty much all regexp languages, if you want to match '-'
> inside a character set then you must specify it as the first
> character

I was looking at the cppreference docs here:

https://en.cppreference.com/w/cpp/regex/ecmascript

The grammar they give includes:

CharacterClass ::

    [ [ lookahead ∉ {^}] ClassRanges ]
    [ ^ ClassRanges ]

ClassRanges ::

    [empty]
    NonemptyClassRanges

NonemptyClassRanges ::

    ClassAtom
    ClassAtom NonemptyClassRangesNoDash
    ClassAtom - ClassAtom ClassRanges
 
ClassAtom ::

    -
    ClassAtomNoDash
    ClassAtomExClass(C++ only)
    ClassAtomCollatingElement(C++ only)
    ClassAtomEquivalence(C++ only)

ClassAtomNoDash ::

    SourceCharacter but not one of \ or ] or -
    \ ClassEscape

I believe that allows [A-Z0-9-_/], doesn't it?

Anyway, all this prompted me to do some more investigation and
some benchmarking. The libraries that I have tried are libstdc++
(as supplied with g++ 8.3, so rather old), Boost.Regex,
Boost.Xpressive (with run-time expression strings, not the
Spirit-like compile time mode) (both Boost version 1.75),
RE2, and CTRE.

What I'm trying to do is to sanitise the input to an internet-
exposed process, to reject malicious input'); drop table users;
As an example I'll look at input that is supposed to be base-64
encoded and no more than a couple of kilobytes long.

Typical-case performance doesn't matter much as this runs once
per process invocation (and hence also caching the compiled
regex doesn't help), but I do want to be sure that it doesn't
have bad worst-case complexity in the face of pathological
input. So my first test is a quick check with a regular
expression that should might trigger worst-case behaviour in
a non-linear implementation:

a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa

matched against: aaaaaaaaaaaaaaaaaaaa

The execution times were:

CTRE: 1.1 us
RE2: 148 us
Xpressive: 27286 us
Boost.Regex: 31 us
libstdc++: 88632 us

Based on that, Xpressive and libstdc++ can be rejected immediately.
(Of course this doesn't prove that the others use exclusively-linear
algorithms; they may have heuristics that handle that case or just
got lucky; this is why I believe there should be complexity guarantees.)

Here are the patterns that I have benchmarked for my base64 test:

1. [A-Za-z0-9/+=]{0,8192}
2. [A-Za-z0-9/+=]*
3. (?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))
4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))

Recall that base64 has chunks of four printable characters
with the final chunk using = to pad. Variants 3 & 4 strictly
check the padding. Variants 1 and 3 check for excessive length
while 2 & 4 require a separate check to do that.

Note that I'm using the "non capturing" syntax (?: ) rather
than ( ) because I only need the boolean match result.

First a note on compatibility. I noted before that expressions
like [A-Za-z0-9-_/] were accepted by some libraries but not others.
I found two other issues: only libstdc++ would accept [A-Z]{4}*,
while the others all required ([A-Z{4})*. Then RE2 rejected the
{0,8192} and {0,2048} repeats - it limits them to some smaller
value.

A note on compile times (g++ 8.3 -O3): there was a substantial
variation, with RE2 and CTRE being the fastest, Boost.Regex and
libstdc++ in the middle, and Boost.Xpressive slowest. The
difference from fastest to slowest was about 10X. It was interesting
that the "Compile Time Regular Expression" library CTRE was one of
the fastest to compile!

Regarding run-time performance, testing with about 3 kbytes of
input data: CTRE was fastest. RE2 was second in the two expressions
that it did not reject. Boost.Regex was last.

My conclusion is that CTRE is the best choice, and I would recommend
it unless (a) you need to specify the regular expression at runtime,
or (b) you need some of the "irregular" Perl extension syntax.

I hope that is of interest.

Regards, Phil.

P.S. I am subscribed to the list digest, which used to flush whatever
had been posted at least once per day; now it doesn't seem to send
anything until it has reached its threshold. Do others see this or is
it just me? Can it be fixed? I would have replied earlier, and
separately to the other replies, if I had received a digest at some
point.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk