|
Boost : |
From: Soronel Haetir (soronel.haetir_at_[hidden])
Date: 2021-07-23 15:23:03
Phil Endecott,
Match time linear to the length of the string works fine if the regex
doesn't have alternations but breaks down completely if they are
present.
Imagine you have the regex
"auto|bool|const|continue|...|([A-Z_][A-Za-z0-9_]*)" set up to match
any C token (the ... in this case are missing keywords, integers, etc)
but your input is "continuation". This regex will try for each
alternate, get almost all the way with "continue" but then have to
backtrack and try the other alternates. It's only when you get to the
very last that the string will match. And it would be possible to
contrive even worse examples (regex match over a dictionary for
example)
John Maddock was describing regex in general as not being able to
specify the complexity rather than your specific example.
On 7/23/21, Phil Endecott via Boost <boost_at_[hidden]> wrote:
> Thanks all for your replies.
>
> John Maddock wrote:
>> Complexity for regular expression is really really hard to specify
>
> I disagree. Pretty much by definition, Regular Expressions can
> be matched in time linear in the length of the string and that's
> what I'd expect a std::c++ spec to require, just as it requires
> sorting to be N log N etc. etc. The fact that Perl >bleurgh< chose
> to provide what I'd call "irregular expressions" 25 years ago
> doesn't mean that we had to copy that. I reject the idea that we
> have to support back-references in patterns because "people expect
> that" - I've never used them.
>
>
> Dominique Devienne wrote:
>> Have you tried Russ Cox's RE2 from https://github.com/google/re2
>
> Thanks for the pointer. He has a series of three really good
> documents explaining the history of regular expression implementations
> and how we've slipped into this hole where the most common
> implementations have poor complexity guarantees. See:
>
> https://swtch.com/~rsc/regexp/
>
> This also prompted me to look at CTRE, which compiles regular
> expressions from strings at compile time and even works pre-C++20.
> See: https://github.com/hanickadot/compile-time-regular-expressions
>
>
> Gavin Lambert wrote:
>> In pretty much all regexp languages, if you want to match '-'
>> inside a character set then you must specify it as the first
>> character
>
> I was looking at the cppreference docs here:
>
> https://en.cppreference.com/w/cpp/regex/ecmascript
>
> The grammar they give includes:
>
> CharacterClass ::
>
> [ [ lookahead â {^}] ClassRanges ]
> [ ^ ClassRanges ]
>
> ClassRanges ::
>
> [empty]
> NonemptyClassRanges
>
> NonemptyClassRanges ::
>
> ClassAtom
> ClassAtom NonemptyClassRangesNoDash
> ClassAtom - ClassAtom ClassRanges
>
> ClassAtom ::
>
> -
> ClassAtomNoDash
> ClassAtomExClass(C++ only)
> ClassAtomCollatingElement(C++ only)
> ClassAtomEquivalence(C++ only)
>
> ClassAtomNoDash ::
>
> SourceCharacter but not one of \ or ] or -
> \ ClassEscape
>
>
> I believe that allows [A-Z0-9-_/], doesn't it?
>
>
> Anyway, all this prompted me to do some more investigation and
> some benchmarking. The libraries that I have tried are libstdc++
> (as supplied with g++ 8.3, so rather old), Boost.Regex,
> Boost.Xpressive (with run-time expression strings, not the
> Spirit-like compile time mode) (both Boost version 1.75),
> RE2, and CTRE.
>
> What I'm trying to do is to sanitise the input to an internet-
> exposed process, to reject malicious input'); drop table users;
> As an example I'll look at input that is supposed to be base-64
> encoded and no more than a couple of kilobytes long.
>
> Typical-case performance doesn't matter much as this runs once
> per process invocation (and hence also caching the compiled
> regex doesn't help), but I do want to be sure that it doesn't
> have bad worst-case complexity in the face of pathological
> input. So my first test is a quick check with a regular
> expression that should might trigger worst-case behaviour in
> a non-linear implementation:
>
> a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa
>
> matched against: aaaaaaaaaaaaaaaaaaaa
>
> The execution times were:
>
> CTRE: 1.1 us
> RE2: 148 us
> Xpressive: 27286 us
> Boost.Regex: 31 us
> libstdc++: 88632 us
>
> Based on that, Xpressive and libstdc++ can be rejected immediately.
> (Of course this doesn't prove that the others use exclusively-linear
> algorithms; they may have heuristics that handle that case or just
> got lucky; this is why I believe there should be complexity guarantees.)
>
> Here are the patterns that I have benchmarked for my base64 test:
>
> 1. [A-Za-z0-9/+=]{0,8192}
> 2. [A-Za-z0-9/+=]*
> 3.
> (?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))
> 4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))
>
> Recall that base64 has chunks of four printable characters
> with the final chunk using = to pad. Variants 3 & 4 strictly
> check the padding. Variants 1 and 3 check for excessive length
> while 2 & 4 require a separate check to do that.
>
> Note that I'm using the "non capturing" syntax (?: ) rather
> than ( ) because I only need the boolean match result.
>
> First a note on compatibility. I noted before that expressions
> like [A-Za-z0-9-_/] were accepted by some libraries but not others.
> I found two other issues: only libstdc++ would accept [A-Z]{4}*,
> while the others all required ([A-Z{4})*. Then RE2 rejected the
> {0,8192} and {0,2048} repeats - it limits them to some smaller
> value.
>
> A note on compile times (g++ 8.3 -O3): there was a substantial
> variation, with RE2 and CTRE being the fastest, Boost.Regex and
> libstdc++ in the middle, and Boost.Xpressive slowest. The
> difference from fastest to slowest was about 10X. It was interesting
> that the "Compile Time Regular Expression" library CTRE was one of
> the fastest to compile!
>
> Regarding run-time performance, testing with about 3 kbytes of
> input data: CTRE was fastest. RE2 was second in the two expressions
> that it did not reject. Boost.Regex was last.
>
> My conclusion is that CTRE is the best choice, and I would recommend
> it unless (a) you need to specify the regular expression at runtime,
> or (b) you need some of the "irregular" Perl extension syntax.
>
> I hope that is of interest.
>
>
> Regards, Phil.
>
>
> P.S. I am subscribed to the list digest, which used to flush whatever
> had been posted at least once per day; now it doesn't seem to send
> anything until it has reached its threshold. Do others see this or is
> it just me? Can it be fixed? I would have replied earlier, and
> separately to the other replies, if I had received a digest at some
> point.
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
-- Soronel Haetir soronel.haetir_at_[hidden]
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk