Boost logo

Boost :

From: Vinnie Falco (vinnie.falco_at_[hidden])
Date: 2021-10-13 16:02:45


On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
<boost_at_[hidden]> wrote:
> Have you looked at the RFC 3986 errata at
> https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0

ruh roh.. I have not gotten to that yet but yeah. I see problems :)
Not difficult problems, but there are relevant errata that must be
fixed before any review. Thanks!

> ? For example, I note that your rule for relative-part on the
> "Parsing" page doesn't seem to include the path-abempty alternative
> added in erratum 5428.

Thanks I need to update the javadocs and exposition. Although, I think
the parsing is still correct despite not showing path-abempty. I will
add more tests.

> I would be interested to see how compile time, run time, and
> lines-of-code compare between your implementation, a Spirit parser,
> and a regexp.

Someone volunteered to do that so I might explore it at one point.

> You also have the choice of (a) parsing and storing
> pointers to each component as I believe you do,

Right, the library keeps a small table of offsets to various parts.
For the modifiable containers I plan to also keep a table for the
segments and query parameters, to allow O(1) access of indexed
elements (right now its linear). I think this is a good balance for
accessing parts of the URL and helping speed up mutation operations.

> (b) could be better if
> you're mostly just passing around complete URLs, storing them in
> containers, etc.

Well, in url the table adds overhead. If you don't want that overhead
then... just store the URL string :) and parse it again when you need
it, this is equivalent to your (b), and less impact on the library.

> There's also the question of worst-case complexity; does your parser
> backtrack? If so, is there a pathological input that causes it to
> take exponential time?

I have done no profiling or optimization to speak of except some minor
design work to facilitate the use of character-based SIMD algorithms
later. I don't have a monolithic "parser" per-se, there are classes
called "bnf elements" which can be used in the generic bnf::parse
function. These elements call parse on child elements recursively. I
do backtrack. For example, URI-reference is defined as:

    URI-reference = URI / relative-ref

The class uri_reference_bnf implements this by first attempting to
parse the scheme and colon, and if that fails then it backtracks and
attempts parsing a "relative-ref." You can see this here:

https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278fd0e7/include/boost/url/rfc/impl/uri_reference_bnf.ipp#L35

I'm not really sure how this can be improved, you need to know if
there's a scheme and that can only be determined by scanning forward
until reaching a colon or an invalid character for scheme. In theory
this could scan the entire input. Thus I have not put any effort into
making this better.

> Can you add a complexity guarantee to the
> docs? (I posted about this in the context of regular expression
> libraries a few weeks ago.)

Maybe... I did it for JSON because performance is critical and the
difference between a poorly performing library and a greatly
performing library is massive. I'm not sure that URL parsing is the
same. URLs tend to be short and it is sort of difficult to really get
bad performance out of parsing them even if you go a character at a
time. Unlike JSON you don't need to convert between base 2 and base 10
for numbers (which is a headache). Mutations always give the strong
exception safety guarantee and it also guarantees only one memmove, I
will probably document that.

I can give you a better answer in the future when it is at the point
where benchmarks are explored.

> auto url = parse_url(str);
>
> The problem is that url is a view of the string parameter, but
> nothing warns the user about that and...

Well, it is documented profusely. I agree there is some danger here
but with great danger comes great power. Or something about great
responsibility?

> ..now url is dangling. Am I right, or is there something that
> prevents this?

No you are right, in the statement below `uv` references the string
and does not take ownership:

    url_view uv = parse_uri( "https://example.com/" ).value();

However, url is constructible from url_view and url DOES make a copy
with ownership, so this statement also works and doesn't have the
problem of dangling:

    url u = parse_uri( "https://example.com/" ).value();

> I believe that functions that returns views should be named so
> that that is obvious, and in cases like this the function with the
> simplest name should be the safer version that returns a value-type.

These algorithms only return views, because there are a few different
owning URL containers and they can all be constructed from the views.
Overloading on return type doesn't really work and some containers may
have template parameters or allocators which makes APIs to parse into
them awkward. If we're debating the name though, it means we have
accepted the general design of the library :)

> Also I note that parse_uri() returns a result<> and doesn't throw.
> My preference would be to name that e.g. try_parse_uri() and for the
> simpler-named function to be a wrapper that returns a plain uri
> or throws on error. My rationale is to make the simplest-named
> functions provide the easiest-to-use functionality, so that
> *teaching C++ to a Javascript programmer is not so embarrassing*,
> while retaining the more performant API variants for those who
> need them.

Well the whole point of boost::system::result (the new type that
Boost.URL uses) is to cut down on those extra names and overloads.
Your suggestion would keep the additional complexity of result and
also increase the number of overloads which isn't sounding like a
clear cut win to me... although, this "result" return type is sort of
new.

> (Is it parse_uri() or parse_url()? I thought you had decided on
> URL.)

I use the term URL generally, but I use the precise wording when
referring to grammar. "parse_uri" uses the URI grammar specified in
rf3986, modulo any errata fixes which I have yet to apply.

> "pct" doesn't immediately mean "percent" to me.

The grammar says "pct-encode", I followed that. I think there is
sufficient use of the abbreviation that it is justified but I could be
convinced otherwise if there is new data.

> It doesn't look like your url type is comparable. I think it
> should be. The comparisons need to be careful about case-sensitivity.>

> And it should be hashable.

You can track those issues here:

<https://github.com/CPPAlliance/url/issues/64>
<https://github.com/CPPAlliance/url/issues/65>

> I recently had to implement request signing for AWS. This requires
> canonicalisation of URLs. There are some subtleties in percent
> encoding: you need to be able to control exactly which characters
> are escaped (I think you allow this); you need to be able to
> control whether + is used to escape space (do you?)

Yes and yes.

> and you need
> to be able to control whether the hex characters are upper or
> lower case (do you?).

But no to this... we can add it though, there's a "pct_encode_opts" structure:

<https://master.url.cpp.al/url/ref/boost__urls__pct_encode_opts.html>

> This code also needed to sort the query parameters by key; that
> raises the question of whether the query is a sequence or an
> associative container of key-value pairs, and if an associative
> container, what the comparison function is.

The library treats the query parameters as a ForwardRange (url_view)
or RandomAccessRange (url) of key/value pairs, allowing for duplicate
keys. The functions find/find_next can be used to visit all values
with the same key. If you are using the "encoded" params container (by
calling encoded_params()) then comparisons are bitwise. If you are
using the "plain" params container (by calling params()) which decodes
everything for you, then the comparison is made using notional
percent-decoding. That is, your decoded key is compared against each
container key as-if it had percent-decoding applied.

> I would hope that the path would be similar to a std::filesystem::path
> i.e. it should use the same vocabulary where possible (or could
> even actually be a std::filesystem::path, on C++17).

OOF I hope never to see this. std::filesystem is kind of a mess. These
are the path containers:

https://master.url.cpp.al/url/ref/boost__urls__segments.html
https://master.url.cpp.al/url/ref/boost__urls__segments_encoded.html
https://master.url.cpp.al/url/ref/boost__urls__segments_view.html
https://master.url.cpp.al/url/ref/boost__urls__segments_encoded_view.html

"view" containers are read-only and reference the character buffer
without ownership, these containers are obtained by calling functions
on url_view. The other containers allow modification (the iterators
are still read-only, because a proxy-assignment implementation proved
to be clumsy) and these containers are obtained by calling functions
on url.

> Conversion between std::filesystem::path and file: URLs would be
> nice to have.

Yeah something like that is possible, falling back to
boost::filesystem on C++11.

> In your table 1.7, "Return the type of host specified, if any."
> should I think say "Return the host, if any."

hmm, yeah - the function "host_type()" is missing from that list (it
returns the type of host).

> Your percent-decoding methods return strings. I'm a bit surprised
> that you've not tried to return lazy decoding views, at least as an
> option.

Now that is a very interesting idea! I didn't think of it. But, its
hard to see the use case, what would you do with a lazy view other
than unroll it into a linear buffer?

Thanks


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk