|
Boost : |
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2021-10-14 12:03:40
Vinnie Falco wrote:
> On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
> <boost_at_[hidden]> wrote:
>> There's also the question of worst-case complexity; does your parser
>> backtrack? If so, is there a pathological input that causes it to
>> take exponential time?
>
> I have done no profiling or optimization to speak of except some minor
> design work to facilitate the use of character-based SIMD algorithms
> later. I don't have a monolithic "parser" per-se, there are classes
> called "bnf elements" which can be used in the generic bnf::parse
> function. These elements call parse on child elements recursively. I
> do backtrack. For example, URI-reference is defined as:
>
> URI-reference = URI / relative-ref
>
> The class uri_reference_bnf implements this by first attempting to
> parse the scheme and colon, and if that fails then it backtracks and
> attempts parsing a "relative-ref." You can see this here:
>
> https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278fd0e7/include/boost/url/rfc/impl/uri_reference_bnf.ipp#L35
>
> I'm not really sure how this can be improved, you need to know if
> there's a scheme and that can only be determined by scanning forward
> until reaching a colon or an invalid character for scheme. In theory
> this could scan the entire input. Thus I have not put any effort into
> making this better.
>
>> Can you add a complexity guarantee to the
>> docs? (I posted about this in the context of regular expression
>> libraries a few weeks ago.)
>
> Maybe... I did it for JSON because performance is critical and the
> difference between a poorly performing library and a greatly
> performing library is massive. I'm not sure that URL parsing is the
> same. URLs tend to be short and it is sort of difficult to really get
> bad performance out of parsing them even if you go a character at a
> time.
data: URLs are a good example of long URLs that actually exist in
practice. But what I'm really thinking about is guarding against
malicious malformed input. You don't want your parser to have N^2
let alone exp(N) complexity when the input could legitimately be
a data: URL of a few kbytes.
And there's no need for it to have worse than linear complexity,
because I'm pretty sure that the URL syntax is a Regular Language (in
the computer science sense, not in the "regular expression is a
synonym for search pattern" I'm-a-perl-programmer sense), and all
regular languages can by definition be parsed in linear time.
Regarding your example, i.e. "http://foo.com/" vs the relative URL
"http.html", the point is that when the parser has reached the 'p'
it will be in a state meaning "either this is a scheme or the first
segment of the path of a relative URL". The transformation from
BNF to a no-backtracking state machine of that sort is what tools
like flex and libraries like RE2 do.
Now having said all that, I do suspect that the particular grammar
of URLs can be parsed even by a backtracking parser in linear time
(with some constant factor) - but it will depend on the rules.
I think you aspire to making the library "secure", and if that
includes not having pathological performance in the presence of
malicious input, you need to understand this stuff!
>> auto url = parse_url(str);
>>
>> The problem is that url is a view of the string parameter, but
>> nothing warns the user about that and...
>
> Well, it is documented profusely. I agree there is some danger here
> but with great danger comes great power. Or something about great
> responsibility?
I don't think this is acceptable. What do others think?
>> I would hope that the path would be similar to a std::filesystem::path
>> i.e. it should use the same vocabulary where possible (or could
>> even actually be a std::filesystem::path, on C++17).
>
> OOF I hope never to see this. std::filesystem is kind of a mess.
Well it's not my top favourite library either. But it's familiar,
and the authors have considered many of the issues that have been
mentioned in other threads i.e. whether path components include
directory separators, distinguishing absolute and relative paths,
etc.
Having mentioned data: URLs above, I have tried to understand how
these would be handled by your library. (mailto: also.) Note that
a data: URL will generally contains /s, e.g. in the content type
e.g. image/jpeg and also in the base64 data but these are not
directory separators.
I think what's really needed is a specific getter/setter for the
body of an "unconventional" URL scheme like these. It may be
possible to use your set_encoded_path() but it's not obvious.
Please fix the docs for set[_encoded]_query() which currently
seem confused with the operations for fragments.
Regards, Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk