Boost logo

Boost :

Subject: Re: [boost] Push/pull parsers & coroutines
From: Vinnie Falco (vinnie.falco_at_[hidden])
Date: 2017-10-14 19:54:14


On Sat, Oct 14, 2017 at 12:03 PM, Phil Endecott via Boost
<boost_at_[hidden]> wrote:
> The
> issue of generator<T> providing only input iterators is the most
> significant issue I've spotted so far. This is in some way related
> to the whole ASIO "buffer sequence" thing; the code I posted before
> read into contiguous buffers, but that was lost before the downstream
> code saw it, so it couldn't hope to optimise with e.g. word-sized
> copies or compares.

Buffer sequences are not the problem, it is that parsed HTTP data
types are heterogeneous. For example, the series of types generated
when parsing a request looks like this:

1. std::pair<verb, string>: verb enum (if known) and method string
2. string: request-target string
3. integer (HTTP-version)
4. vector<tuple<field, string, string>>: field name enum (if known), name, value
5. vector<string>: body data
OR
5. vector<string, string>: body data plus chunk-extension

An interface which presents parsed data through a function return
value (for example, an iterator's operator*) is only capable of
yielding one type. The only way to use the same control flow and
produce different types is to do two things: inform the caller of the
type of the next incoming object, and then provide a set of functions
from which the caller chooses the correct one with the proper matching
return type for receiving the next value.

You can see this in the Boost.Http parser calling code:

        do {
            request_reader.next();
            switch (request_reader.code()) {
            case code::skip:
                // do nothing
                break;
            case code::method:
                method = request_reader.value<token::method>();
                break;
            case code::request_target:
                request_target = request_reader.value<token::request_target>();
                break;
            case code::version:
                version = request_reader.value<token::version>();
                break;
            case code::field_name:
                last_header = request_reader.value<token::field_name>();
            }
        } while(request_reader.code() != code::end_of_message);

A viable alternative, which does not preserve the same structure of
calling code, is to use a type of "visitor". The parser calls a user
defined function specific to the next anticipated token, whose
argument list has the correct types. This is the approach used in
Beast. The parser calls a particular member function of the derived
class depending on what structured element was parsed. The arguments
to the member function have the correct high level types.

For example, when Beast parses the request-line it invokes a member
function with this signature in the derived class:

    /// Called after receiving the request-line (isRequest == true).
    void
    on_request_impl(
        verb method, // The method verb, verb::unknown
if no match
        string_view method_str, // The method as a string
        string_view target, // The request-target
        int version, // The HTTP-version
        error_code& ec); // The error returned to the caller, if any

Note the rich variety of types: `verb` is an enumeration of known HTTP methods:

<http://www.boost.org/doc/libs/master/libs/beast/doc/html/beast/ref/boost__beast__http__verb.html>

`method_str` is the exact method string extracted by the parser. This
is needed when the method does not match one of the method strings
known to the library, indicated by the enumeration value
`verb::unknown`.

`target` is a straightforward string, while `version` is conveyed as an integer.

Since the parser owns the control flow at the time the member function
is called, the `ec` output parameter allows the callee to indicate
that it wishes to break out of the parser's loop and return control to
the calling function.

After the request-line comes zero or more calls to a member function
with field name/value pairs. That member function signature looks like
this:

    /// Called after receiving a header field.
    void
    on_field_impl(
        field f, // The known-field enumeration constant
        string_view name, // The field name string.
        string_view value, // The field value
        error_code& ec); // The error returned to the caller, if any

Note how the collection of types presented for a header field is
different from the request-line. Expressing this irregular stream of
different types through an iterator interface is going to be very
clumsy. Furthermore, there is metadata generated during the parse
which is not easily reflected in an iterator interface.

For example, after the HTTP headers have been parsed, Beast calculates
the "keep-alive" semantic as well as the disposition of the
Content-Length, which may be in three states: body-to-eof, chunked, or
known. The keep-alive semantics are communicated to the caller of the
parser through a member function `basic_parser::is_keep_alive`:

<http://www.boost.org/doc/libs/master/libs/beast/doc/html/beast/ref/boost__beast__http__basic_parser/is_keep_alive.html>

I described in a previous post how Beast's parser exposes two
interfaces. The public interface is consumed by stream algorithms
(e.g. read_some, async_read_some) while the derived class interface is
used to store structure HTTP elements. The function `is_keep_alive` is
exposed through the public interface of the parser because it is
primarily of interest to the stream algorithm, since the stream
algorithm concerns itself with the connection and whether or not it
should be closed afterwards.

Meanwhile, the Content-Length disposition is exposed to the derived
class since it is a piece of metadata of interest to the algorithm
which stores the body in the message container. It is communicated by
the parser through a call to this derived class member:

    /// Called just before processing the body, if a body exists.
    void
    on_body_init_impl(
        boost::optional<
            std::uint64_t> const&
                content_length, // Content length if known, else
`boost::none`
        error_code& ec); // The error returned to the caller, if any

There is so much type irregularity in the information presented during
the parse that I feel an iterator based approach would be, to use
informal terms, "quite ugly."

Thanks


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk