Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2024-02-22 08:17:11


On Wed, Feb 21, 2024 at 12:27 PM Phil Endecott via Boost
<boost_at_[hidden]> wrote:
>
> Boost.Parser Review
> ===================

[snip]

> - There's an ugly MACRO to define rules! That doesn't look very
> 2024 to me.

Addressed below.

> Test Case
> ---------

[snip]

Thanks for sharing the code, and marking it up as you have. It was
really helpful to have this.

> //const auto degrees_symbol = bp::no_case[ bp::lit("degrees") | bp::lit("deg") | bp::lit('d') ];
> //const auto minutes_symbol = bp::no_case[ bp::lit('\'') | bp::lit("minutes") | bp::lit("min") | bp::lit('m') ];
> //const auto seconds_symbol = bp::no_case[ bp::lit('"') | bp::lit("seconds") | bp::lit("sec") | bp::lit('s') ];
> // FIXME those rules don't work and I don't know why. They work if the symbol is
> // the first alternative; it it's any of the other alternatives the double returned
> // by e.g. decimal_degrees below is always zero.
> // Use these instead:
> const auto degrees_symbol = bp::lit('d');
> const auto minutes_symbol = bp::lit('m');
> const auto seconds_symbol = bp::lit('s'); // Beware, 's' can also mean south.

Yes, this is indeed a bug! The sequence parser was wiping out the
value of a previously-successful parser's attribute, when the next
parser had an internal failure (like not matching "degrees", but then
matching 'd'). This is now fixed on the boost_review_changes branch.
This is code I was messing about with right before the review; sorry
about that.

> const auto uint_0_60_def = bp::uint_ [( [](auto ctx) { _pass(ctx) = _attr(ctx) < 60U; _val(ctx) = _attr(ctx); } )];
> const auto double_0_60_def = bp::double_ [( [](auto ctx) { _pass(ctx) = _attr(ctx) < 60; _val(ctx) = _attr(ctx); } )];
>
> const auto decimal_degrees = bp::double_ >> -degrees_symbol;
>
> const auto degrees_decimal_minutes_def = (bp::uint_ >> -degrees_symbol
> >> double_0_60 >> -minutes_symbol) [( [](auto ctx) {
> auto d = _attr(ctx)[0_c];
> auto m = _attr(ctx)[1_c];
> _val(ctx) = d + m/60.0;
> } )];
> // FIXME this rule matches e.g. 150.5 as 150 .5; we want that input to
> // be matched by the decimal degrees rule as 150.5. In the qi code, which
> // has the same fundamental problem, it is avoided because the double_
> // parser is configured to reject input with a leading decimal point.
> // It would be more correct to require that either a degrees symbol or
> // whitepace or both between the degrees and minutes numbers. That's
> // a bit complicated because the whitespace is handled by the skipper.
> // In a bottom-up parser I think the ambiguity is resolved by putting
> // decimal_degrees first in degrees_def below. That doesn't work for a
> // top-down parser.
> // FIXME for similar reasons, "150d 0.5m n" fails to parse, because
> // the degrees_minutes_seconds rule matches 150,0,.5 and then the next
> // rule up fails to parse the 'm'.

This is really interesting! I did not know that Spirit X3 had made
this change, but the real/float parser I use is just the one from
Spirit X3. And sure enough, it's policies class starts with this:

    template <typename T>
    struct ureal_policies
    {
        // trailing dot policy suggested by Gustavo Guerra
        static bool const allow_leading_dot = true;
        static bool const allow_trailing_dot = true;
        static bool const expect_dot = false;

So, not only does it allow a leading dot, but it's a constant; you'd
have to inherit from this policies type and hide the value with a new
one in the derived class.

Again, I don't know why this was done, but I can understand the desire
to match .6, 0.6, and 6. with the same parser. I also see how in your
case the 0.6-only match has more natural semantics. I was able to
achieve the same goal by changing your parser not to match on double
if the input starts with '.'. So this:

> const auto degrees_minutes_seconds_def = (bp::uint_ >> -degrees_symbol
> >> uint_0_60 >> -minutes_symbol
> >> double_0_60 >> -seconds_symbol) [( [](auto ctx) {
> auto d = _attr(ctx)[0_c];
> auto m = _attr(ctx)[1_c];
> auto s = _attr(ctx)[2_c];
> _val(ctx) = d + m/60.0 + s/3600.0;
> } )];

Becomes this:

> const auto degrees_minutes_seconds_def = (bp::uint_ >> -degrees_symbol
> >> uint_0_60 >> -minutes_symbol
> >> (double_0_60 - '.') >> -seconds_symbol) [( [](auto ctx) {
> auto d = _attr(ctx)[0_c];
> auto m = _attr(ctx)[1_c];
> auto s = _attr(ctx)[2_c];
> _val(ctx) = d + m/60.0 + s/3600.0;
> } )];

That fixed the parse.

[snip]

> const auto signed_latitude_def = signed_degrees [( [](auto ctx) { auto d = _attr(ctx); _pass(ctx) = -90 <= d && d <= 90; } )];
> const auto signed_longitude_def = signed_degrees [( [](auto ctx) { auto d = _attr(ctx); _pass(ctx) = -180 <= d && d <= 180; } )];

One thing I noticed here -- you forgot the "_val(ctx) = _attr(ctx);"
statement, and so this path of the parse would match, but produce 0.

With those three changes (the fix, the " - '.'", and this missing
assignment), everything seems to be working as expected. I've turned
your example into a test:

https://github.com/tzlaine/parser/blob/boost_review_changes/test/parse_coords_new.cpp

> BOOST_PARSER_DEFINE_RULES(uint_0_60);
> BOOST_PARSER_DEFINE_RULES(double_0_60);
> BOOST_PARSER_DEFINE_RULES(degrees_decimal_minutes);
> BOOST_PARSER_DEFINE_RULES(degrees_minutes_seconds);
> BOOST_PARSER_DEFINE_RULES(degrees);
> BOOST_PARSER_DEFINE_RULES(latitude);
> BOOST_PARSER_DEFINE_RULES(longitude);
> BOOST_PARSER_DEFINE_RULES(signed_latitude);
> BOOST_PARSER_DEFINE_RULES(signed_longitude);
> BOOST_PARSER_DEFINE_RULES(latlon);

This could be replaced with a single use of BOOST_PARSER_DEFINE_RULES, FWIW.

[snip]

> That Ugly Macro
> ---------------
>
> bp::rule<struct latitude, double> latitude = "latitude";
> const auto latitude_def = (degrees >> northsouth) .... ;
> BOOST_PARSER_DEFINE_RULES(latitude);
>
> Is a macro really the best way to do this? Spirit seems to
> manage without.

About that. This revision to the way that Spirit 2 did rules is not
actually my invention. It comes from Spirit X3:

https://live.boost.org/doc/libs/1_68_0/libs/spirit/doc/x3/html/spirit_x3/tutorials/minimal.html#spirit_x3.tutorials.minimal.parser_definitions

I liked the way it was done there better than the Spirit 2 way, so I
opted for the Spirit 3 way. You can't get recursive rules, or even
mutually recursive rules, without a way to forward declare parsers
somehow. Rules are what does that. In Spirit 2, they did this by
including their attribute type as part of their type, and then using
type erasure to avoid requiring the type of the parser to be part of
the rule's type as well. In Spirit X3, a rule is instead a function
template. I like this better as we don't need to build memory
allocation into the rule to get decoupling. I dislike it because the
function template is only subtly different from rule to rule, and so
that leads naturally to the macro we have. I think the macro is the
lesser of these evils.

> Despite having a macro this is still very verbose. I've had to
> write the name of the rule FIVE times in three lines. Once there
> is a macro you might as well put everything inside it:
>
> BOOST_PARSER_DEFINE_RULE(latitude, double, (degrees >> northsouth) ....);
>
> Of course if you need to declare the rules before defining them
> due to recursion or other forward references you will need separate
> macros for declaration and definition.

Yeah, that's what Spriit X3 does. There are DECLARE and DEFINE
macros. I was trying to simplify. You're not the first one to ask
for a macro that reduces the repetition needed for a single
rule/parser pair. If enough others agree, I'll add this. I
personally tend not to use macros like that much, or when I do, to
make ad hoc ones that I locally #undef after use. So the lack of the
DECLARE/DEFINE pair is just my personal taste.

> These rule declarations seem to need to be at global scope
> (maybe namespace scope?). Is that fundamental? Can't I put them
> in a class or a function?

The macro defines a pair of function templates for each rule you pass
to it; this is mentioned in the docs. The implication is that the
macro should be used at namespace scope. As long as the tag type you
use is in the same namespace where you write the macro, ADL will do
the rest.

> Semantic Actions
> ----------------
>
> In Bison I can write semantic actions like this:
>
> { $$ = $1 + $2/60.0; }
>
> After expanding the $ variables that gets copied verbatim into
> the bison output, so it can include any code.
>
> Spirit clearly wanted to replicate that sort of functionality,
> but it pre-dates lambdas, so it includes a lot of amazing
> functionality that allows me to write semantic actions like:
>
> [ _val = _1 + _2/60.0 ]
>
> This is nice and concise but its limitation is of course that I
> cannot write arbitrary code in there.
>
> (BTW, are we strictly allowed to use identifiers starting with _
> like _val?)

Are you asking about the Spirit 2 use of Phoenix? There is a naming
convention there and in Spirit X3, and in Boost.Parser. In the latter
case though, it's just a lambda, so you can name it whatever you like.

> What should this look like now that we have lambdas in the
> language? Of course it's going to be a bit more verbose because
> of the syntax overhead, but I was hoping that it would be
> something like this:
>
> [ [](auto deg, auto min) { return deg + min/60.0; } ]

This is the biggest thing I miss from Spirit 2 code vs. Spirit X3
code. However, I don't really miss dealing with the verbosity of
Phoenix in the non-trivial cases. Lambdas are a win on balance I
find.

> (Aside: That doesn't actually work because [[ can only appear in
> attributes even if there is a space, so it needs to be written
> [( [] ... )]. I did wonder if changing from operator[] to
> operator() might make sense?)

Sure. C++. I don't like the idea of changing the operator to
operator(), because 1) everyone used to Spirit is used to operator[],
and 2) operator() is already valid on many parsers and directives.
Having to deal with a new, anything-goes operator() would be fraught
at best.

> That's longer than before, but I have given meaningful names to
> the parameters (which I can also do in Bison BTW) and I can
> include any code.
>
> But that's not the syntax that the current proposal needs. It
> needs this:
>
> [( [](auto ctx) { _val(ctx) = _attr(ctx)[0_c] + _attr(ctx)[1_c]; } )]

Yep. You could also write:

auto [a, b] = _attr(ctx);
_val(ctx) = a + b;

... if you were using std::tuple. Neither of those is as pithy as
_val = _1 + _2. However, as soon as you do anything significantly
more complicated than that, lambdas are better.

> Frankly I think that combines the worst of all the other options
> and adds some extra new verbosity of its own.
>
> One other advantage of having the lambda return a result could
> be improving how the semantic type (aka attribute type) of a
> rule is set. For example, if I have the rule
>
> degrees >> minutes
>
> where degrees and minutes return doubles, the semantic type of
> the combined rule is something like tuple<double,double>. But
> I'm going to add a semantic action that adds the two values as
> shown above, returning one double. If the return value of the
> semantic action's lambda were used as the semantic type of the
> rule then we could write:
>
> auto rule = (degrees >> minutes)
> [( [](auto d, auto m) { return d + m/60.0; } )];
>
> and rule would return a double to its parent rule. As it is, the
> semantic action doesn't change the type and we have to
> explicitly and verbosely declare the rule as returning a double.

I had not thought of this before, but I really like it. I'll have to
think pretty deeply about whether there is some reason this might not
work, but I'm definitely going to try to implement this. I'm thinking
that the rule would be: if std::apply(f, tup) is well-formed (where f
is your invocable, and tup is your tuple of values returned by
_attr()), and decltype(std::apply(f, tup)) is assignable to _val(),
then do the semantic action as _val(ctx) = std::apply(f, _attr(ctx));
otherwise, the semantic action gets called with the current context..
There would be a special case when _attr() is not a tuple, but you get
the idea. Ticket to track it:
https://github.com/tzlaine/parser/issues/106

> Of course the context includes more than just the child rule
> semantic values and a slot for the result; for example it
> includes the _pass boolean that I use to clamp the range of
> values in my example:
>
> [( [](auto ctx) { _pass(ctx) = _attr(ctx)[0_c] <= 90; } )];
>
> I suggest that either a special lambda signature is used to
> indicate that it is a predicate, or an additional "directive"
> e.g. predicate[ ... ] is added.

This idea I don't like as much. I think the previous idea makes
things simpler in simpler cases, and something like _pass(ctx) should
just use the ctx version of the semantic action.

> Compilation Times and Errors
> ----------------------------
>
> Compilation times for my example were:
>
> Boost.Parser: 28.6 s
> Boost.Spirit: 30.1 s

I get something pretty similar, except that it's about 5 seconds for each.

> but the grammar is not quite identical, so the small difference
> is not significant. Code size is somewhat larger for Parser than
> for Spirit.
>
> (g++ 12.2.0, -O3, Linux, aarch64.)
>
> I did not particularly notice improved error messages compared
> to Spirit, though they weren't worse. Consider for example this
> code from my example:
>
> return parse(s, input, bp::ws|bp::lit(','));
>
> Originally I had this:
>
> auto skipper = bp::ws | bp::lit(',');
> return parse(s, input, skipper);

Oof. Yeah, that's a constraint problem. It's pretty easily fixed;
see https://github.com/tzlaine/parser/issues/105 . If you make
skipper const, the problem goes away. The problem is that the
overload getting selected is the one that takes an out-param, so it's
trying to assign your result to skipper.

> But that failed to compile, and I got 15 screenfuls of
> inpenetrable error messages. I fixed it more-or-less by randomly
> changing things until it worked. If I now change it back I get
> only two screenfuls that accurately say that I can't return a
> bool from a function that's supposed to return an optional, but
> I still don't know why the return type of parse() changes when I
> move the declaration of the skipper.
>
> So I fear that the idea that compile times and error verbosity
> are improved compared to Spirit are not borne out.

Well, its a mixed bag. The kind of error you ran into is the worst
kind. When you try to use an out-param for your parser's result that
does not work, the errors are very messy. It sort of has to be this
way though, because the loose-matching of parser attribute to
out-param works becasue we defer any kind of checking until the last
possible moment -- which is always dozens of instantion stack frames
in.

Where the library improves over Spiriit is when you write many kinds
of errors in your own semantic action lambdas, especially when the
error is some kind of call to ctx, like _attr(ctx). That can often be
turned into a runtime error that is very quickly diagnosed.

> Bugs
> ----
>
> I found a couple of bugs while writing this review which Zach
> has resolved. I suspect that this code has not had enough varied
> use to flush out all the issues.

This well might be. There are a lot of tests, but the combinatorics
make it hard to catch all the corners.

> Other Issues
> ------------
>
> The double_ parser is not configurable in the way that the
> Spirit double_ parser is. This is problematic; in my example
> 3e6n means "3 east 6 north", not "3000000 north", so I need a
> double_ parser that doesn't recognise exponents. I'd also like
> to ignore values with leading .s, i.e. 10.1 should not match
> int_ >> double_ as "10 0.1", and I really don't want a hacker to
> sneak a nan or inf into my code if it recognises those.

Addressed above.

> The docs are not bad and are reminiscent of the Spirit docs.
> Personally I would prefer to see an example early on that takes
> a chunk of real BNF and translates it into a parser.
>
> I had issues with Apple Clang (15.0.0); it would not compile
> "#include <boost/parser.hpp>" until I turned off concepts. Has
> anyone tried with other Clangs? If this is not fixable maybe the
> library should detect this and use non-concepts mode
> automatically.

This appears to be a Clang bug, or very nearly related to one; see
https://github.com/tzlaine/parser/issues/89 . Andrzej has since said
that some change I made seems to have fixed/side-stepped the problem.
I don't have great access to Clang, so I did not verify this myself.

> Compiling with -Wall resulted in lots of unused parameter
> warnings. These can be fixed by commenting-out the parameter
> name, leaving just the type. Boost libraries should compile
> without warnings, some people use -Werror.

I usually do compile with -Wall. Perhaps it got turned off at some
point. Will fix: https://github.com/tzlaine/parser/issues/107

> Conclusion
> ----------
>
> I was hoping to find a library that was an improvement on
> Spirit, but I regret to say that that's not what I've found
> here.
>
> The only notable improvement compared to Spirit.Qi that I've
> found is that no_case[] is propogated to nested rules, which
> isn't the case with Qi.
>
> Rules themselves are essentially the same as Spirit. The method
> for defining rules is more verbose and now needs an ugly macro.
> Semantic actions are less readable.
>
> It is possible that there are significant advantages compared to
> Spirit that I have missed, as a result of having used working
> Spirit code as my starting point, but I have read the doc's
> comparison with Spirit page.
>
> My conclusion is that this library should not be accepted.
>
> If others believe it should be accepted then I feel that should
> be conditional on many of the issues I've mentioned above being
> resolved, e.g. unused parameter warnings, configuration of the
> double_ parser, etc.

[snip]

Thanks for spending the time, and for your thoughtful review, Phil.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk