Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2024-02-29 04:37:29

On Wed, Feb 28, 2024 at 9:18 PM Alan de Freitas via Boost
<boost_at_[hidden]> wrote:
> Boost.Parser Review


> > Does the API match with current best practices?
> Using the same conventions as Spirit to combine parsers seems like a good
> design here.
> I was a little confused about the convention to assume unicode for anything
> other than char.
> I'm not sure that's the convention that minimizes surprise.

The rationale is this: All the other character types in C++ -- every
one -- advertises that it is a certain Unicode encoding
(char{8,16,32}_t) or is specifically designed to hold Unicode code
points (wchar_t). char is the only exception to this.

> I would expect the same convention (unicode or not) to apply to any data
> type because:
> - It's one more thing to remember.
> - I often don't want to interpret things as code points regardless of
> whether it's char or unsigned char.
> - I noticed many people also don't care about unicode most of the time in
> their use cases.
> - It also seems to be the case that the relevant keywords in most grammars
> are ASCII.
> - The documentation says "the Unicode parsing API will not actually cost
> you anything" but two sentences later it says "one caveat is that there may
> be an extra branch on each char, if the input is UTF-8"
> At least, the library documentation could make it very clear right at the
> start how the user can explicitly specify if they want to enable/disable
> unicode support for whatever data type they wish.

For the performance part, I would be very surprised to see a
measurable difference between chars and chars + as_utf32. The reason
is that parsing is *super* branchy, and the branch on each char here
is "if (c < 128)". The branch predictor will guess right after seeing
that a few times.

As for turning off Unicode for charN_t and wchar_t, do you have a
compelling use case for that? It seems very unlikely to me.

> I find the macros to define rules a little unsettling.
> - Isn't it possible to at least document what things would look like
> without a macro so that readers can have a good idea of what the macro is
> doing whenever they use it?

I've already added this to the mid-review branch, but each rule you
give to the macro expands to this:

#define BOOST_PARSER_DEFINE_IMPL(_, rule_name_) \
    template< \
        bool UseCallbacks, \
        typename Iter, \
        typename Sentinel, \
        typename Context, \
        typename SkipParser> \
    auto parse_rule( \
        decltype(rule_name_)::parser_type::tag_type *, \
        std::bool_constant<UseCallbacks> use_cbs, \
        Iter & first, \
        Sentinel last, \
        Context const & context, \
        SkipParser const & skip, \
        boost::parser::detail::flags flags, \
        bool & success) \
    { \
        auto const & parser = BOOST_PARSER_PP_CAT(rule_name_, _def); \
        return parser(use_cbs, first, last, context, skip, flags, success); \
    } \
    template< \
        bool UseCallbacks, \
        typename Iter, \
        typename Sentinel, \
        typename Context, \
        typename SkipParser, \
        typename Attribute> \
    void parse_rule( \
        decltype(rule_name_)::parser_type::tag_type *, \
        std::bool_constant<UseCallbacks> use_cbs, \
        Iter & first, \
        Sentinel last, \
        Context const & context, \
        SkipParser const & skip, \
        boost::parser::detail::flags flags, \
        bool & success, \
        Attribute & retval) \
    { \
        auto const & parser = BOOST_PARSER_PP_CAT(rule_name_, _def); \
        using attr_t = decltype(parser( \
            use_cbs, first, last, context, skip, flags, success)); \
        if constexpr (boost::parser::detail::is_nope_v<attr_t>) { \
            parser(use_cbs, first, last, context, skip, flags, success); \
        } else { \
            parser( \
                use_cbs, first, last, context, skip, flags, success, retval); \
        } \

This will now be in the docs, and you can just not use the macro and
do... that. I think most people will prefer the macro, but I'm also
open to alternatives.

> - A single macro to define the rules could also be useful.
> - It would also be useful if the cases where macros are not as useful
> (non-recursive rules?) could be specified in the documentation.
> About Hana:
> - I don't like optional dependencies in general, so I also don't like the
> idea of supporting std::tuple and hana tuples separately.
> - Since the library is boost/standalone and tuple/hana, then that's already
> four combinations that are unlikely to be tested for all relevant compilers
> and environments.
> - That's more possibilities of conflicts for users compiling different
> versions of the library.
> - I especially don't like the idea of using another type for tuples when we
> already have a vocabulary type from the standard.
> - I feel like whatever Hana provides can be achieved with std::tuple and
> maybe some other library like mp11 in extreme cases.
> - It's probably possible to support Hana as a peer-dependency where the
> rule would fill any tuple-like object according to some concept (the rules
> can already do that for containers, variants and optionals if I'm not
> wrong).
> - Including all of <boost/hana.hpp> can increase compile times
> significantly.
> Maybe I'm missing something and Hana enables something great but I don't
> see it. If it were that great and important, it wouldn't be an _optional_
> dependency.

hana::tuple has one great feature that std::tuple does not:
operator[]. It is *so* much nicer to use that than std::get<>(), that
it makes it worth all the trouble. I have been using Parser to write
parsers for a few years now, and I'm telling you, don't sleep on op[].

> Maybe the parsers for numbers could use Boost.Charconv under the hood.
> I don't know if that's feasible for all cases.

It is feasible. It's already done on my branch.

> In our applications, we often have charsets defined by functions that
> return whether a char is reserved in a grammar.
> I couldn't find a good way to express that with Boost.Parser (although
> operator|() combining charsets are often enough).
> If that's correct, it would be good to have some alternatives here.
> It would also be good to have rules for very common charsets if that
> doesn't already exist.

I don't really know what you mean by "reserved" here, but the parsers
are all pretty compact, and are all copyable and assignable. If you
want to return one from a function, you can do so. You just need to
make sure that either: 1) you construct it with one or two characters,
like char_('a')/char_('a', 'g'), or 2) if you construct with a range,
like char_("abcdef"), that the range you use outlives the parser.

> > Is the documentation helpful and clear?
> Yes. The documentation is clear. It's driven by goals and examples, which
> is great.
> Knowing Boost.Spirit helped me, so it might be useful to give more weight
> to feedback from people who don't know Boost.Spirit yet.
> I found the section on "Writing Your Own Parsers" very limited considering
> that we know "some people are obsessed with writing everything for
> themselves".
> Very often, we have ad hoc rules that are hard to express as a combination
> of other rules.
> At other times, as a library evolves, custom rules can make things less
> generic and more efficient even if they could be expressed with the more
> generic rules.
> For instance, many examples in the documentation would be more easily
> represented as functions in my opinion.
> So I find this insufficient: "It is a basic orientation to get you familiar
> enough with all the moving parts of writing a parser that you can then
> learn by reading the Boost.Parser code" because the documentation should be
> exposing all the concepts and documenting everything.
> Telling such a usual user to read the code to reverse engineer how it
> should be done is no good. Especially because the number of concepts and
> requirements involved is probably not much larger than what's already
> documented.

I don't really expect too many users to do this, and it was never a
design goal to make this easy to do. I sort of begrudgingly added
that section at all because Andrzej was hounding me about it. :) If I
were to do it properly, as a first-order design element, I'd probably
redesign it a bit. Right now, I don't intend to. Maybe this would be
future work, if people start writing a lot of them.

> Are there small examples of binary parsers somewhere?

No, there are no binary parsers.

> The section "The Parse Context" is a little confusing.
> Unlike other sections, it's not goal-oriented with lots of examples.

This is an excellent point; I'll revise that section. Ticket:

> One thing I'm also missing is some cheat sheets with all the symbols
> provided by the library related to what a given section is discussing.

This is also available on my branch.

> > Did you try to use it? What problems or surprises did you encounter?
> Yes. It worked fine with MSVC over here but CI needs to improve because
> people have reported failures with other compiler versions.
> It's not a reason to reject the library but it's something that needs to be
> fixed and causes a bad first impression.
> CI should also include coverage, more warnings, sanitizers, and probably
> fuzz tests.

I intend to add coverage, do not intend to turn on warnings past -Wall
(which I always use), always build with ASan enabled. I'm intrigued
by the idea of fuzzing, but I haven't figured out what to fuzz
exactly. I guess I could have fuzzed input to the parse() overloads,
but I'd really like to also have fuzzed combinations of parsers. I'm
all ears if anyone can figure out how to do that. In the earliest
versions of the library, I actually randomly generated parsers to
catch problems.

> I believe Sam can also help with Drone if that's needed.

I'm very interested in covering more platforms. I'll reach out to Sam
about setting up Drone.

Thanks for your review!


Boost list run by bdawes at, gregod at, cpdaniel at, john at