Boost logo

Boost :

Subject: [boost] Push/pull parsers & coroutines (Was: Boost.HTTPKit, a new library from the makers of Beast!)
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2017-10-13 18:59:35


Dear All,

This is related to the ongoing discussion of the Beast HTTP parser.
I have been thinking in general about how best to implement parser
APIs in modern and future C++. Specifically, I've been wondering
whether the imminent arrival of low-overhead coroutines ought to
change best practice for this sort of interface.

In the past, I have found that there is a trade-off between parser
implementation complexity and client code complexity. A "push" parser,
which invokes client callbacks as tokens are processed, is easier to
implement but harder to use as the client has to track its state
between callbacks with e.g. an explicit FSM. On the other hand, a
"pull parser" (possibly using an iterator interface) is easier for
the client but instead now the parser may need the explicit state
tracking.

Now, with stackless coroutines due "real soon now", we can avoid
needing explicit state on either side. In the parser we can
co_yield tokens as they are processed and in the client we can
consume them using input iterators. The use of co-routines doesn't
need to be explicit in the API; the parser can be said to return a
range<T>, and then return a generator<T>.

Here's a very very rough sketch of what I have in mind, for the case
of HTTP header parsing; note that I don't even have a compiler that
supports coroutines yet so this is far from real code:

generator<char> read_input(int fd)
{
  char buf[4096];
  while (1) {
    int r = ::read(fd,buf,4096);
    if (r == 0) return;
    for (int i = 0; i < r; ++i) {
      co_yield buf[i];
    }
  }
}

template <typename INPUT_RANGE>
generator< pair<string,string> > parse_header_lines(INPUT_RANGE input)
{
  typedef INPUT_RANGE::const_iterator iter_t;
  iter_t i = input.begin(), e = input.end();
  while (i != e) {
    iter_t j = std::find(i,e,':');
    string k(i,j);
    // (That's broken, as iter_t is a single-pass input iterator. We
    // need to copy to the string and check for ':' at the same time.
    // It's trivial with a loop.)
    ++j;
    iter_t k = std::find(j,e,'\n');
    string v(j,k);
    ++k;
    i = k;
    co_yield pair(k,v);
  }
}

void parse_http_headers(int fd)
{
  map<string,string> headers;
  auto g = parse_header_lines( read_input(fd) );
  for (auto h: g) {
    headers.insert(h);
  }
}

An "exercise for the reader" is to extend that to something that will
parse headers followed by a body.

Questions: how efficient is this in practice? Is this really simpler to
write than a non-coroutine version? Will all of our code use this style
in the (near?) future? How should we be writing code now so that it is
compatible with this style in the future?

Thanks for reading,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk