Boost logo

Boost :

From: Christian Mazakas (christian.mazakas_at_[hidden])
Date: 2024-02-22 22:35:41


plaintext dump as requested:

I've been a huge fan of Spirit for many years now.

One of my favorite Spirit memories was during two separate job interviews.

The first time, I was interviewing with a game company and on my resume, I
noted that
I had written a simple HTTP server that performed simple geospatial
querying, creating a
list of cities within a radius of a provided zip code. Here's the link
<https://github.com/cmazakas/location-service>.

I remember that I had used Spirit for parsing a CSV and the interviewer was
genuinely
concerned about me using such a library for such a seemingly small task. It
was the main
focus of the interview and I think represented a cultural divide. I did not
get the job.

The second time, I used it as a litmus test for me writing C++ at the
company. They saw me
use Spirit and still hired me anyway which I took to be a great sign. This
time, I got
the job.

## Boost.Parser

Cutting to the chase, I vote we ACCEPT Parser.

To test the library, I've been using a small test project here:
https://github.com/cmazakas/parser-review

My criteria for acceptance was that I'd be able to write an HTTP/1 parser
with feature
parity to what is currently being done in the proposed Boost.Http.Proto
library. I feel
that I was able to achieve this goal and I stopped proceeding because I no
longer felt
like the library itself would be a blocker.

Now, even though I'm voting ACCEPT I'm going to spend most of the review
criticizing
the library. To me, the value-add of a type-safe parser combinator library
is obvious
so I won't spend too much time here. Basically, testable, reusable parsers
are good and
even more good when they elegantly compose as well.

### The Feedback

The differences between Spirit.X3 and Parser need to be made much more
clear by the
documentation, by which I mean:
> * Spirit X3 has rules that do not compose well — the attributes produced
by a rule can change depending on the context in which you use the rule.
> * Spirit X3 is missing many of the convenient interfaces to parsers that
Spirit 2 had. For instance, you cannot add
parameters to a parser.
> * All versions of Spirit have Unicode support, but it is quite difficult
to get working.

These need code samples showing the limitations of X3 and how Parser
actually solves them. I think many,
myself included, aren't overwhelmingly clear on what these differences are
and how Parser solves them
and _why_ this is a good thing.

Though I have attempted X3's Unicode before and I will say, Parser's
unified interface is much better.
For example, I wrote an entire RFC-compliant URL parser in X3, including
Unicode support, for my library
Foxy
<https://github.com/cmazakas/foxy/blob/7f4ac0495ad2ed9cd0eca5994743d677ac1d2636/include/foxy/uri.hpp#L353-L356>
.

---
Not a criticism but Parser's directives are a dramatic improvement over X3.
By these, I'm thinking of
`boost::parser::string_view[]`, `boost::parser::separate[]` and
`boost::parser::merge[]`. These should be
highlighted in the comparison vs X3 as I think they add a good bit of
substantive value.
More praise for the library includes its parse API which is a dramatic
improvement over X3 in  terms of
ergonomics. Personally, I can see myself main using `prefix_parse`,
similarly to X3's original parse() but
I think the value of `parse()` itself is obvious and doesn't warrant much
explanation.
---
The library generates a dramatic amount of warnings when compiled with
`-Wall -Wextra -pedantic`. I had
originally forgotten to `include_directorie()` Parser as a `SYSTEM`
directory and was met with 172 warnings.
I think being clean under `-Wall -Wextra -Werror` is reasonable criteria
for acceptance.
---
When writing the parser for the HTTP version line, I ran into a lot of
issues separating the major version
from the minor version when using:
    parser::digit >> "." >> parser::digit
It was only when I was lazily browsing the docs that I found the
`separate[]` directive which I think indicates
the library's documentation has poor discoverability.
Ultimately, it was cleaner for me to just do:
    // HTTP-version = HTTP-name "/" DIGIT "." DIGIT
    // HTTP-name = %x48.54.54.50 ; HTTP
    auto const on_version = [](auto &ctx) {
      auto const &tup = _attr(ctx);
      auto const major = tup[0_c];
      auto const minor = tup[1_c];
      if (major > 9 || minor > 9) {
        _pass(ctx) = false;
        return;
      }
      response::metadata &md = _globals(ctx).md_;
      md.version_.major_ = major;
      md.version_.minor_ = minor;
    };
    auto const http_name = parser::lit("HTTP");
    auto const http_version =
        (http_name >> "/" >> parser::uint_ >> "." >>
parser::uint_)[on_version];
---
The documentation doesn't exactly give users any guidelines about when they
should formally
define rules with attributes vs semantic actions.
This is one of those things where if users don't already have the intuition
built up, using the
library is an exercise in frustration.
Namely, attributes don't compose well when you start nesting rules and want
to have different out
params at different levels of the parsers. I suppose this should be obvious
because the rules themselves
form a hierarchy so the attributes _must_ mirror that.
I found that the best thing to do for my use-case was to abuse
`_globals(ctx)` and semantic actions.
I think a good rule of thumb for users can be expressed as something like:
use rule attributes for simple
grammars where all you really want is just the `std::vector<double>`. For
something complex with domain-defined
validation rules, prefer `_globals(ctx)` and semantic actions for
validating everything.
In general, Parser's semantic actions are a huge improvement over Spirit's.
---
I felt like I came up with working but unsatisfactory solutions to parsing
characters in a specific range.
Namely, I just wrote a simple semantic action around `+parser::char_[f]`
but this made me feel like I was
writing unidiomatic code. I'd appreciate some feedback here:
    // reason-phrase = 1*( HTAB / SP / VCHAR / obs-text )
    // VCHAR         = %x21-7E
    auto const on_reason_phrase = [](auto &ctx) {
      auto ch = _attr(ctx);
      if (ch != '\t' && ch != ' ' && (ch < '\x21' || ch > '\x7e')) {
        _pass(ctx) = false;
        return;
      }
      response::metadata &md = _globals(ctx).md_;
      std::ranges::subrange<char const *> sub = _where(ctx);
      if (!md.reason_phrase_begin_) {
        md.reason_phrase_begin_ = sub.begin();
      }
      md.reason_phrase_end_ = sub.end();
      md.status_line_end_ = sub.end();
    };
    auto const reason_phrase =
parser::omit[+parser::char_[on_reason_phrase]];
---
There should be a debugging utility for discovering what the synthesized
attribute is for a rule. Perhaps
this could exist as a semantic action that attempts to instantiate an
incomplete template.
    template<class> struct diagnostic;
    auto const debug=[](auto& ctx){diagnostic<decltype(_attr(ctx))> d;};
It took me a long time to realize what the attribute type for `digit >> "."
>> digit` was (it was `std::string`).
---
> Dependencies are still a nightmare in C++, so Boost.Parser can be used as
a purely standalone library, independent
of Boost.
This was funny for me to read because I had a working vcpkg port working
for my test project in like 2 minutes.
---
The library seems to really emphasize Boost.Hana's `tuple` type but I think
in practice, this winds up
being a mistake unless there's substantive compile-time savings by using
Hana here.
Namely, Hana's `tuple` doesn't support structured bindings. Because Parser
is a C++17 library, I consider
this a somewhat large defect of the interface. What's worse, Hana's
literals aren't compatible with Parser's.
A huge boon for Hana and its `tuple` is that it supports indexing via
`tup[0_c]` but this doesn't work when
Parser's literal are involved as it also defines a `_c` so an ambiguity is
introduced and compilation fails.
Between this and the lack of structured binding support, I think relying on
Hana here is a strict downgrade unless
there's compelling reasons I'm simply unaware of.
https://github.com/boostorg/hana/milestone/4
---
Stuff like `_globals` in the docs would 404 for me. I also think the docs
are kind of just incomplete in general
and because there's not much cross-referencing, discoverability is hurt so
a user has to essentially read the entire
docs front-to-back to understand the library which is a large ask for a
parsing library.
### Conclusion
Overall, I think that even with all of the problems I found in the library,
it's still quite good and worthy
of being brought into Boost as the successor for Spirit. I don't think
anything I found was a deal-breaker and
I think in a few iterations of Boost, this library can be something quite
great.
- Christian

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk