Boost logo

Boost :

From: Vinnie Falco (vinnie.falco_at_[hidden])
Date: 2022-08-23 16:42:34


On Tue, Aug 23, 2022 at 8:44 AM Zach Laine <whatwasthataddress_at_[hidden]> wrote:
> Ok, I'm convinced.

The URL classes are kind of weird in the sense that they have three parts:

1. the getter/setter functions for the singular pieces of the URL
2. the container-like interface for the segments
3. the container-like interface for the params

Because each URL effectively exposes two different containers/ranges
they are turned into separate types. It wouldn't make sense to have
url::begin() and url::end().

Just so that we are on the same page, and anyone who is reading this
now or in the future has clarity, when you write

    url u( "http://www.example.com/path/to/file.txt" );

    segments us = u.segments();

The value `us` models a lazy, modifiable BidirectionalRange which
references the underlying `url`. That is to say, that when you invoke
modifiers on us, such as:

    us.pop_back();

it is the underlying `url` (or `static_url`) which changes. `segments`
is a lightweight type which has the semantics of a reference. If you
were to, say, make a copy of `us` you are just getting another
reference to the same underlying url. A `segments` cannot be
constructed by itself.

When we say that the range is lazy, this means that the increment and
decrement operation of its iterators executes in linear time rather
than constant time. And of course there is no random access. The
laziness refers to the fact that incrementing a path iterator requires
finding the next slash ('/') character in the underlying URL.

> I am still not convinced that the containers that maintain these invariants
> should be lazy. That still seems weird to me. If they own the data, and
> are regular types, they should probably be eager.

Here is where I am a little lost. When you say "the containers that
maintain these invariants" are you referring to `segments` or `url`?
Because `segments` does not actually implement any of the business
logic required to modify the path. All of that is delegated to private
implementation details of `url` (or more correctly: `url_base`).

Perhaps when you say "if they own the data", the term "they" refers to
the `url`? Even in that case, laziness is required, because this
library only stores URLs in their serialized form. There were earlier
designs which did it differently but it became apparent very quickly
that the tradeoffs were not favorable.

I'm not quite sure what "eager" means in this context.

> The larger issue to me is that they have a subset of the expected STL API.

Right, well we implemented as much STL-conforming API as possible
under the constraint that the URL is stored in its serialized form.
Really I consider myself as more of an explorer than a designer,
because once we made the design choice that the URL would be stored
serialized, the remainder of the API design and implementation was
more of an exercise in discovering what the consequences of that
design choice would be and how familiar to standard containers we
could make the interfaces become.

Matching the STL API would require giving up one or more things that
we currently have. This is possible, but we leave it up to the user to
make this decision (by copying the data into a new std container).

> That's kind of my point. You have a parsing minilib that is useful
> for URL parsing, but not *general use*. If that's the case, I think
> you should present it as that, and not a general use parsing lib.

Yes you are right about this, it is not for general use. It is
specifically designed for implementing the ABNF grammars found in
protocol-related RFCs such as rfc3986 which defines URL grammars used
in Boost.URL, non-well-known schemes, HTTP messages, HTTP fields,
Websocket fields. For example consider this grammar (from rfc7230)

    Transfer-Encoding = 1#transfer-coding
     transfer-coding = "chunked" ; Section 4.1
                        / "compress" ; Section 4.2.1
                        / "deflate" ; Section 4.2.2
                        / "gzip" ; Section 4.2.3
                        / transfer-extension
     transfer-extension = token *( OWS ";" OWS transfer-parameter )
     transfer-parameter = token BWS "=" BWS ( token / quoted-string )

A downstream library like not-yet-proposed-for-boost.HTTP.Proto could
use this minilib thusly:

constexpr auto transfer_encoding_rule = list_rule( transfer_coding_rule, 1 );

<https://github.com/CPPAlliance/http_proto/blob/f2382d8eab8be2e9d6e6e14c5502d90ccf55e95f/include/boost/http_proto/rfc/transfer_encoding_rule.hpp#L117>

There's a lot going on here behind the scenes. HTTP defines the
list-rule which is a comma separate sequence of elements where due to
legacy reasons you can have extra unnecessary comments and whitespace
anywhere between the elements. In ABNF the list-rule is denoted by the
hash character in the Transfer-Encoding grammar above ( "one or more
of transfer-coding" )

Boost.URL provides the lazy range which allows the downstream library
to express the list-rule as a ForwardRange of transfer_coding. This
allows the caller to iterate the list elements in the
Transfer-Encoding value without allocating memory for each element.
There is a recurring theme here - I use lazy ranges to defer memory
allocation :) This goes back to Beast which offers a crude and quite
frankly inelegant set of lazy parsing primitives. I took that concept
and formalized it in Boost.URL and used the principle to let users
opt-in to interpreting the path and query as ranges of segments and
params respectively.

Now this is a downstream library so you might wonder what this has to
do with URLs. Well, Boost.URL is designed to handle ALL URLs. This
includes the well-known hierarchical schemes like http and file but
also opaque schemes, of which there are uncountably many as often
these schemes are private or unpublished. However, the library can't
possibly know how to decompose URLs into the parts defined by these
schemes. In order to do this, a user has to write a parsing component
which understands the scheme.

We will use the mailto scheme as an example. First let me point out
that ALL URLs which use the mailto scheme, are still URLs. They follow
the generic syntax, and Boost.URL is capable of parsing them with no
problem - since it can parse all URLs no matter the scheme.

But users who want to do a deep-dive into the mailto scheme can't be
satisfied merely with parsing a mailto URL. They want it decomposed,
and obviously Boost.URL can't do that in the general case because
every scheme is different. Here is the syntax of a mailto URI:

      mailtoURI = "mailto:" [ to ] [ hfields ]
      to = addr-spec *("," addr-spec )
      hfields = "?" hfield *( "&" hfield )
      hfield = hfname "=" hfvalue
      hfname = *qchar
      hfvalue = *qchar
      addr-spec = local-part "@" domain
      local-part = dot-atom-text / quoted-string
      domain = dot-atom-text / "[" *dtext-no-obs "]"
      dtext-no-obs = %d33-90 / ; Printable US-ASCII
                     %d94-126 ; characters not including
                               ; "[", "]", or "\"
      qchar = unreserved / pct-encoded / some-delims
      some-delims = "!" / "$" / "'" / "(" / ")" / "*"
                   / "+" / "," / ";" / ":" / "@"

To begin, a user might write this function:

    result< url_view > parse_mailto_uri( string_view s );

This is easy to implement at first because all mailto URIs are URLs.
We might start with this:

    result< url_view > rv
    parse_mailto_uri( string_view s )
    {
       auto rv = parse_uri( s );
       if( ! rv )
           return rv.error();
       if( ! grammar:ci_is_equal( rv->scheme(), "mailto" ) )
           return error::scheme_mismatch;
       ...
       return *rv;
       }

This is a good start but it is unsatisfying, because we are getting
the "to" fields in the path part of the URL, and Boost.URL doesn't
know how to split up the recipients of the mailto since they are comma
separated and not slash separated. Remember though, that this is still
a valid URL and that Boost.URL can represent it.

So now we want to implement this grammar:

    to = addr-spec *("," addr-spec )

If you expand the addr-spec ABNF rule you will see that it has
unreserved character sets, percent-encoding possibilities, and quoted
strings. I can't get into all this here (perhaps it would make a good
example for a contributor) but you might start like this:

    constexpr auto addr_spec_rule = grammar::tuple_rule(
        local_part_rule, squelch(grammar::delim_rule('@')), domain_rule );

then you would continue to define each of those rules, and eventually
you would be able to 1. validate that a particular mailto URL matches
the scheme, and 2. decompose the elements of the mailto URL based on
the requirements of the scheme itself.

The idea here is to incubate grammar/ in URL, as more downstream
libraries get field experience with it, and then propose it as its own
library. I'm hoping to see people implement custom schemes, but maybe
that's wishful thinking.

Phew...

Thanks


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk