Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2021-10-13 14:09:48


Hi Vinnie,

Vinnie Falco wrote:
> Well, I was on a pause but I picked Boost.URL back up and whipped it
> into shape.

Thanks for posting that. Assorted comments follow.

Back in 2005-ish I decided to write a URL parser based on the RFC BNF
using Boost.Spirit. At that time the most recent RFCs were RFC2616
(1999) and RFC2396 (1998). I did a fairly direct translation of the
BNF to Spirit and ... it didn't work. It turned out that the BNF in
the RFCs was wrong, and had been for 6 years at that point. So I
advise that you are careful about directly using the RFC BNF.

Have you looked at the RFC 3986 errata at
https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0
? For example, I note that your rule for relative-part on the
"Parsing" page doesn't seem to include the path-abempty alternative
added in erratum 5428.

I would be interested to see how compile time, run time, and
lines-of-code compare between your implementation, a Spirit parser,
and a regexp. You also have the choice of (a) parsing and storing
pointers to each component as I believe you do, or (b) parsing to
validate but storing only the complete string, and doing a partial
parse when necessary to extract components. (b) could be better if
you're mostly just passing around complete URLs, storing them in
containers, etc.

There's also the question of worst-case complexity; does your parser
backtrack? If so, is there a pathological input that causes it to
take exponential time? Can you add a complexity guarantee to the
docs? (I posted about this in the context of regular expression
libraries a few weeks ago.)

This worries me:

auto url = parse_url(str);

The problem is that url is a view of the string parameter, but
nothing warns the user about that and...

str = something_else;

..now url is dangling. Am I right, or is there something that
prevents this?

I believe that functions that returns views should be named so
that that is obvious, and in cases like this the function with the
simplest name should be the safer version that returns a value-type.

(There was a similar discussion for fixed_string, which originally
proposed to return a view from its substr() method.)

Also I note that parse_uri() returns a result<> and doesn't throw.
My preference would be to name that e.g. try_parse_uri() and for the
simpler-named function to be a wrapper that returns a plain uri
or throws on error. My rationale is to make the simplest-named
functions provide the easiest-to-use functionality, so that
*teaching C++ to a Javascript programmer is not so embarrassing*,
while retaining the more performant API variants for those who
need them.

(Is it parse_uri() or parse_url()? I thought you had decided on
URL.)

"pct" doesn't immediately mean "percent" to me. I'd prefer
percent_encode() to pct_encode(). If that's too long for you,
truncate "encode" to "enc". Or maybe I would call this "escaping"
rather than "encoding". Actually I think the RFC terminology
might be "URL encoding"; I don't know if I prefer that but if
you want "strict adherence to the RFCs"...

It doesn't look like your url type is comparable. I think it
should be. The comparisons need to be careful about case-sensitivity.
And it should be hashable.

I recently had to implement request signing for AWS. This requires
canonicalisation of URLs. There are some subtleties in percent
encoding: you need to be able to control exactly which characters
are escaped (I think you allow this); you need to be able to
control whether + is used to escape space (do you?) and you need
to be able to control whether the hex characters are upper or
lower case (do you?).

This code also needed to sort the query parameters by key; that
raises the question of whether the query is a sequence or an
associative container of key-value pairs, and if an associative
container, what the comparison function is.

Actually you might like to have a look at the AWS SDK URI class:
http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_http_1_1_u_r_i.html
One thing that I believe it does is to set port to a default
based on the scheme, if it's not specified in the url string.
I'd say that's a helpful thing to do. Are there any other URI
classes that you could compare/contrast with?

I would hope that the path would be similar to a std::filesystem::path
i.e. it should use the same vocabulary where possible (or could
even actually be a std::filesystem::path, on C++17).

Conversion between std::filesystem::path and file: URLs would be
nice to have.

In your table 1.7, "Return the type of host specified, if any."
should I think say "Return the host, if any."

Your percent-decoding methods return strings. I'm a bit surprised
that you've not tried to return lazy decoding views, at least as an
option.

Anyway that's enough ramblings for now....

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk