Boost logo

Boost :

From: Vinícius dos Santos Oliveira (vini.ipsmaker_at_[hidden])
Date: 2020-09-17 19:30:00


I'm gonna use a perhaps unusual text structure for this review. Consuming
text is a tiring activity that demands energy. Having that in mind, I
ignored the order in which the library is presented through the docs and
moved all trivial concepts discussion to the end. That's my attempt to make
sure the message gets through while the reader still has it -- the energy.

## A short introduction to pull parsers vs push parsers

The act of parsing an incoming stream will generate events from a
predefined set. Users match the generated events against rules to trigger
actions. For the following JSON sample:

```json
{ "foo" : 42 , "bar" : 33 }
```

One of such event streams could be:

```
begin_object -> string_key -> number -> string_key -> number -> end_object
```

The essential difference between pull parsers and push parsers is how this
event stream is delivered. A push parser (a.k.a. SAX parsers) uses the
callback handler model. A pull parser (a.k.a. StAX parsers) stores the
current parsing state and allows the user to query this state.

Essentially every push parser has an implicit pull parser that doesn't get
exposed to the user. This is to say that a push parser would be implemented
in the lines of:

```cpp
// parsing loop
while (it != end) {
    match_token();
    dispatch_token(user_handlers);
}
```

Whereas a pull parser doesn't have the loop construct above, just its
body. The loop construct would be under user-control. That's why pull
parsers are the more fundamental model. A push parser can be built on top
of a pull parser, but not the inverse.

Concrete examples will be shown later in this text.

## Performance: a common misconception

Some push parsing proponents believe pull parsers are slower. This belief
is unsubstantiated. As we have seen in the previous section, a push parser
is always built on top of an implicit pull parser. The only thing that
differs them both is that a small part of the parser logic under the
pulling model moves out of the interface to become user code. That's the
only difference.

One of the arguments brought by push parser proponents is the ability to
merge matching and decoding steps to perform less work. We can use the
previous sample to illustrate this idea:

```
{ "foo" : 42 , "bar" : 33 }
  ^^^^^
```

If we are to decode the `"foo"` token above, the parser first needs to
discover where the token begins and where the token ends. This is the
required matching step that every parser needs to perform. JSON strings can
be escaped and one such example would be:

```json
"na\u00EFve"
```

The decoding steps take care of converting escaped string slices to
unescaped ones. So user code would receive 3 string runs while this token
is being decoded:

- "na"
- "ï"
- "ve"

The parser does not allocate any memory to decode any string. The slices
are feeded to the user who in turns is responsible to put them together
(possibly allocating heap memory).

The idea here is... having separate match and decode steps forces the
parser to walk over the tokens twice. For Boost.JSON, the merged match and
decode steps materialize in the `on_string_part()` (and `on_string()`)
callback:

```cpp
bool on_string_part(string_view v, std::size_t, error_code& ec)
{
    current_str.append(v.data(), v.size());
    return true;
}
```

For a pull parser with _separate_ match and decode steps the interface
would be in the lines of:

```cpp
reader.next();
if (reader.kind() == token::string)
    reader.string(current_str);
```

But as we've seen before, the only difference between pull and push parsers
is something else. Any features lateral to this classification originate
from different taxonomies. Just as a push parser can merge the matching and
decoding steps so can a pull parser:

```cpp
// A new overload.
//
// `my_match_options` here would provide
// the string collector upfront.
reader.next(my_match_options);
```

Another small note here is that just as merging matching and decoding steps
can be beneficial performance-wise so can skipping the whole decoding
step. I'll be touching on this very idea again in the section below.

Appendix A will present how another feature from Boost.JSON can be
implemented in pull parsers: strings split among different buffers.

## The case for pull parsers

So far we've only been exposed to the difference between push parsers and
pull parsers, but there wasn't a case for which choice is ideal (aside from
the naturally understood concept that pull parsers are more fundamental and
push parsers can be built on top when desired).

The defining trait present in pull parsers that is impossible to replicate
in push parsers is their ability to compose algorithms. Composition is
enabled by temporarily handing over the token stream to a different
consumer. Not every consumer needs to know how to process every part of the
stream. That's a trait impossible to find in push parsers.

I have created a [gawk](https://www.gnu.org/software/gawk/) plug-in to
illustrate this property. The same plug-in was implemented several times
using different libraries/approaches. I incrementally added more features
under different branches (there are 8 branches in total). You can find a
rationale for the AWK choice in Appendix B by the end of this review.

The plugin's code can be found at
<https://gitlab.com/vinipsmaker/gawk-jsonstream-archive>. And you may
forgive my beginner code, but I had no experience to write gawk plug-ins
beforehand. That's my first.

AWK is a C-like language (in syntax) where you define pattern-action
pairs. You can find a quick'n'dirty introduction to the language from
Wikipedia. This plug-in adds JSON capabilities to the tool. Given the
following data-set:

```
{"name": "Beth", "pay_rate": 4.00, "hours_worked": 0}
{"name": "Dan", "pay_rate": 3.75, "hours_worked": 0}
{"name": "Kathy", "pay_rate": 4.00, "hours_worked": 10}
{"name": "Mark", "pay_rate": 5.00, "hours_worked": 20}
{"name": "Mary", "pay_rate": 5.50, "hours_worked": 22}
{"name": "Susie", "pay_rate": 4.25, "hours_worked": 18}
```

And the following AWK program (with this plug-in loaded):

```awk
BEGIN {
    JPAT[1] = "/name"
    JPAT[2] = "/pay_rate"
    JPAT[3] = "/hours_worked"
}
J[3] > 0 { print J[1], J[2] * J[3] }
```

The output will be:

```
Kathy 40
Mark 100
Mary 121
Susie 76.5
```

The paths for the JSON tree are given in [standard JSON
Pointer](https://tools.ietf.org/html/rfc6901) syntax. That's the plugin's
core and it's the code that doesn't change among branches, so it's the
appropriate place to begin the explanation.

In the file `jpat.ypp` you'll find a `read_jpat()` function that gets
called for each record. There's some boilerplate in the beginning to bail
out earlier if there were no changes to `JPAT`. Then there is this snippet
that gets processed by [re2c](https://re2c.org/) to generate valid C++
code:

```cpp
* {
    if (*begin != '\0')
        return bad_path();
    jpat[i].components.emplace_back(std::move(current_component));
    goto end;
}
{prefix} {
    jpat[i].components.emplace_back(std::move(current_component));
    current_component = std::string{};
    goto loop;
}
{unescaped} {
    current_component.append(begin, YYCURSOR - begin);
    goto loop;
}
{escaped_tilde} {
    current_component.push_back('~');
    goto loop;
}
{escaped_slash} {
    current_component.push_back('/');
    goto loop;
}
```

The definitions for the data structures follow:

```cpp
struct Path
{
    std::vector<std::string> components;
    awk_value_cookie_t cookie = nullptr;
    awk_string_t orig;
};

struct Tree: std::map<std::string, std::variant<Tree, size_t>>
{};

static std::vector<Path> jpat;
Tree tree;
```

By the end of the re2c-generated loop, the variable `jpat` will be
filled. Then there is some validation/sanitization steps and we start to
fill the `tree` variable out of the extracted paths.

`tree` is a decision tree auxiliary to the process of parsing the JSON
documents. Leaf nodes represent which index in the `J` AWK variable will
receive that value. Given the following `JPAT`:

```awk
JPAT[1] = "/foo/bar"
JPAT[2] = "/foobar"
```

`tree` will have the following properties (pseudocode):

```cpp
// AWK arrays are 1-indexed, so the code
// translation has a 1-element offset
assert(tree["foo"]["bar"] == 0);
assert(tree["foobar"] == 1);
```

The user may find other uses for a JSON decision tree in Appendix C. And
here is where the code branches diverge. So let's present an overview and
then walk them one-by-one (or two-by-two):

```
+------------+---------------+--------------+
| | Boost.JSON | pull parser |
+------------+---------------+--------------+
| basic | boostjson | master |
| dom | boostjson-dom | |
| transform | boostjson2-b | master2 |
| transform2 | boostjson3 | master3 |
+------------+---------------+--------------+
```

The `master` branch is just 506 lines of C++ code total and `boostjson` is
613 lines. The description for the other branches follow:

- `dom`: built after `basic`. This branch is useful to highlight
  performance differences between the DOM parser and the rest.
- `transform`: `boostjson2-b` is built on top of `boostjson-dom`
  branch. `master2` is built on top of `master` branch. It adds the power
  to update the JSON document. Here we start to notice how the ability to
  mark certain points in the token stream are important for a JSON library.
- `transform2`: built on top of the previous branch, it adds the `JUNWIND`
  operation in the likes of MongoDB's `$unwind` operator.

The first thing I'll do is to compare the DOM parsing layer cost. [There's
this 161MiB JSON file that'll be
used](https://drive.google.com/file/d/1-lVxGYWvvmBsjrDVgju_lOTBaK03MdQc/view?usp=sharing).
Just as `jq` can be used to extract a field from this file:

```bash
jq -cM '.features[10000].properties.LOT_NUM' citylots2.json
```

So can AWK:

```bash
gawk -l jsonstream -e '
BEGIN { JPAT[1] = "/features/10000/properties/LOT_NUM" }
{ print J[1] }' citylots2.json
```

The execution time for the boostjson branch will be:

```
real 0m2,112s
user 0m1,476s
sys 0m0,618s
```

And the execution time for the boostjson-dom branch will be:

```
real 0m5,639s
user 0m4,170s
sys 0m1,418s
```

We should always strive to do more than just throwing random
numbers. Anyone can do that. The timings show that the DOM layer is
obviously much slower, but there is no surprise here. It's kind of
ridiculous trying to compare both branches. They do different things
(although the plug-in feature-set is the same). The DOM approach will
always decode and save every matched token. And that's one of the
principles to deliver faster code: do less.

The reason why I've chosen a 161MiB JSON file was to make sure most of the
program's time would be spent not in the gawk VM but in the parsing
plug-in.

If we wanted to expose every value from the DOM tree, `json::value` would
also be useless because the consumer here expects `awk_value_t` objects
(and the DOM interface would guarantee the costs for unnecessarily saving
decoded results twice would be paid). We don't always control the
interface/protocol -- that's part of the C++ programmer's job. Would it be
reasonable to deliver code twice as slow to my client because "hey I have a
nice DOM container... gotta use it"?

As Marcel Weiher has noted in [his call for "less lethargic JSON
support"](https://blog.metaobject.com/2020/04/somewhat-less-lethargic-json-support.html)
[sic]:

> Wait a second! 312MB/s is almost 10x slower than Daniel Lemire's parser,
> the one that actually parses JSON, and we are only creating the objects
> that would result if we were parsing, without doing any actual parsing.
>
> This is one of the many unintuitive aspects of parsing performance: the
> actual low-level, character-level parsing is generally the least
> important part for overall performance. Unless you do something crazy
> like use `NSScanner`. Don't use `NSScanner`. Please.

Let's dive in the code to see how they differ.

### `basic`

Every branch reads a record one document at a line. That's the
`get_record()` function. For the `boostjson-dom` branch, the code goes in
the lines of:

```cpp
try {
    parsed_value.emplace_null();
    json_parser.reset(parsed_value.storage());
    std::error_code ec;
    json_parser.write(line.data(), line.size(), ec);
    if (ec)
        throw std::system_error{ec};

    json_parser.finish(ec);
    if (ec)
        throw std::system_error{ec};

    parsed_value = json_parser.release(ec);
    assert(!ec);
    switch (parsed_value.kind()) {
    case json::kind::array:
        fill_j(parsed_value.get_array(), tree);
        break;
    case json::kind::object:
        fill_j(parsed_value.get_object(), tree);
        break;
    // ...
    }
} catch (const std::exception& e) {
    warning(ext_id, "Error while processing record: %s", e.what());
}
```

The code parses the whole JSON at once and then follows to consume the
document by calling the `fill_j()` function:

```cpp
// Several overloads:
static void fill_j(const json::array& arr, Tree& node);
static void fill_j(const json::object& obj, Tree& node);
static void fill_j(const json::value& v, Tree& node);

// Here is the object overload. You can find the rest in the repository.
static void fill_j(const json::object& obj, Tree& node)
{
    for (auto& [key, value]: node) {
        auto it = obj.find(key);
        if (it == obj.end())
            continue;

        std::visit(
            hana::overload(
                [&](Tree& tree) { fill_j(it->value(), tree); },
                [&](size_t idx) { consume_value(it->value(), idx); }),
            value);
    }
}
```

The code is pretty straight-forward. We check if the fields the user asked
for are present in the document and recurse on them. When we reach a leaf
node, the `J` AWK variable is set (that'd be the `consume_value()`
function).

Do notice that there is no uncontrolled recursion here. Recursion is always
JPAT-delimited. And JPAT is part of the source code, not user-input. There
is no risk of stack overflow here.

The pull parser approach (branch master) somewhat follows the same trend,
but the code is bigger. Here's a selected highlight:

```cpp
void fill_j(json::reader& reader, Tree& node)
{
    // ...

    switch (reader.symbol()) {
    case json::token::symbol::error:
        throw json::error{reader.error()};
    case json::token::symbol::begin_object: {
        if (!reader.next()) throw json::error{reader.error()};

        std::string current_key;
        for (;;) {
            if (reader.symbol() == json::token::symbol::end_object) {
                reader.next();
                break;
            }

            // Key
            assert(reader.symbol() == json::token::symbol::string);
            current_key.clear();
            auto ec = reader.string(current_key);
            assert(!ec); (void)ec;

            if (!reader.next()) throw json::error{reader.error()};

            // Value
            auto it = node.find(current_key);
            if (it == node.end()) {
                json::partial::skip(reader);
                continue;
            }
            std::visit(
                hana::overload(
                    [&](Tree& tree) {
                        switch (reader.symbol()) {
                        case json::token::symbol::begin_array:
                        case json::token::symbol::begin_object:
                            fill_j(reader, tree);
                            break;
                        default:
                            json::partial::skip(reader);
                        }
                    },
                    consume_value),
                it->second);
        }

        if (reader.symbol() == json::token::symbol::error)
            throw json::error{reader.error()};
    }
    // ...
    }
}
```

And here is the first sample on what I meant by composability. Our
"callback" only gets called to handle trees that are part of the decision
tree one way or another (and the root node always belongs here). When it
reaches a subtree that doesn't belong to the JPAT-defined decision tree,
`json::partial::skip()` is called. `json::partial::skip()` is defined
elsewhere and has no knowledge on the internals of our algorithm. This is a
different "callback" being called to handle a different part of the
tree. The two of them know how to handle different groups of tokens.

A nice side effect is that these "phantom" trees will be fully skipped
over. We don't allocate strings to collect values over them. This is an
optimization we get for free. [Designing JSON parsers to exploit this idea
is not unknown](https://github.com/zserge/jsmn).

The last one will be the code for the `boostjson` branch, but I feel like
it'll be easier to understand if I first introduce the tasks it has to
perform in the token stream.

```
{ "foo" : 42 , "bar" : { "foo" : 33 } }
          ^^
```

When the field's value is parsed, we must match the key against our
rule-set to possibly trigger an action. But it is not enough to store the
current key. The whole path must be stored to support the JSON nest-able
structure. We start with the structure to save current context information:

```cpp
struct State
{
    bool is_object; //< false = ARRAY
    std::string current_key;
    size_t current_idx;
    Tree* node;
};
std::vector<State> state;
```

So the beginning of the code for `on_object_begin()` must be:

```cpp
Tree* node = nullptr;
if (state.size() == 0) {
    node = &tree;
} else if (state.back().node) {
    auto it = state.back().node->find(current_key());
    if (it != state.back().node->end())
        node = std::get_if<Tree>(&it->second);
}

state.emplace_back();
state.back().is_object = true;
state.back().node = node;
```

And a matching action in the code for `on_object_end()`:

```cpp
state.pop_back();
```

But it's not really that simple. Paths such as
`/features/10000/properties/LOT_NUM` won't work because `current_key` won't
be provided by any `on_key()`-like callback when current node is
`10000`. `current_key` must be computed for arrays in the value handler
itself. That's why the following snippet must be present in the beginning
of every value-callback (and `on_{object,array}_begin()` too):

```cpp
if (state.size() > 0 && /*is_array=*/!state.back().is_object)
    update_current_array_key();
```

And the matching helper functions:

```cpp
std::string_view current_key()
{
    assert(state.size() > 0);
    return state.back().current_key;
}

void update_current_array_key()
{
    assert(state.size() > 0);
    assert(!state.back().is_object);

    state.back().current_key.resize(max_digits);
    auto s_size = std::to_chars(
        state.back().current_key.data(),
        state.back().current_key.data() + state.back().current_key.size(),
        state.back().current_idx++
    ).ptr - state.back().current_key.data();
    state.back().current_key.resize(s_size);
}
```

The final touch is to realize that the event stream will be something like:

```
string_key -> value -> string_key -> value -> string_key -> value -> ...
```

And we need to clear the `string_key` by the end of every value consumed:

```cpp
if (state.size() > 0) state.back().current_key.clear();
```

So the callback implementation for a simple atom value such as a string is:

```cpp
bool on_string_part(std::string_view v, std::error_code&)
{
    if (state.size() > 0) {
        if (state.back().node == nullptr)
            return true;

        auto it = state.back().node->find(current_key());
        if (it == state.back().node->end())
            return true;

        if (!std::holds_alternative<size_t>(it->second))
            return true;
    }
    current_str.append(v.data(), v.size());
    return true;
}

bool on_string(std::string_view v, std::error_code&)
{
    awk_value_t idx_as_val;
    if (state.size() == 0) {
        current_str.append(v.data(), v.size());
        make_number(0, &idx_as_val);
    } else {
        if (/*is_array=*/!state.back().is_object)
            update_current_array_key();

        if (state.back().node == nullptr)
            goto end;

        auto it = state.back().node->find(current_key());
        if (it == state.back().node->end())
            goto end;

        auto idx = std::get_if<size_t>(&it->second);
        if (!idx)
            goto end;

        current_str.append(v.data(), v.size());
        make_number(*idx + 1, &idx_as_val);
    }
    {
        awk_value_t value_as_val;
        char* mem = (char*)gawk_malloc(current_str.size() + 1);
        memcpy(mem, current_str.c_str(), current_str.size() + 1);
        set_array_element(
            var_j,
            &idx_as_val,
            make_malloced_string(mem, current_str.size(), &value_as_val));
    }

end:
    if (state.size() > 0) state.back().current_key.clear();
    current_str.clear();
    return true;
}
```

But the code still performs differently. There's that optimization we got
for free in the pull parser choice. We must implement a mechanism to skip
over unwanted trees to make the comparison fair. And here's where a
`skipped_levels` variable is introduced. It'll change every callback again
but will finally let us have the following code for the `string_key`
callback:

```cpp
bool on_key_part(std::string_view v, std::error_code&)
{
    if (skipped_levels)
        return true;

    // not called on phantom trees
    state.back().current_key.append(v.data(), v.size());
    return true;
}

bool on_key(std::string_view v, std::error_code&)
{
    if (skipped_levels)
        return true;

    // not called on phantom trees
    state.back().current_key.append(v.data(), v.size());
    return true;
}
```

The full code can be found on the `include/jsonstream/parser.hpp` file.

This explanation should suffice to understand (to quote myself):

> The defining trait present in pull parsers that is impossible to
> replicate in push parsers is their ability to compose
> algorithms. Composition is enabled by temporarily handing over the token
> stream to a different consumer. Not every consumer needs to know how to
> process every part of the stream. That's a trait impossible to find in
> push parsers.

Every callback depends on implementation knowledge from every other
callback. Spaghetti code as they say. I also implemented another change to
both branches -- `master` and `boostjson` -- to further demonstrate this
property. What if we want to store string representation for array indexes
in static arrays instead `std::string` objects?

There's this changeset for `master` branch:

```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -184,20 +184,19 @@ void fill_j(json::reader& reader, Tree& node)
     case json::token::symbol::begin_array: {
         if (!reader.next()) throw json::error{reader.error()};

- std::string current_key;
+ std::array<char, max_digits> current_key;
         for (size_t i = 0 ;; ++i) {
             if (reader.symbol() == json::token::symbol::end_array) {
                 reader.next();
                 break;
             }

- current_key.resize(max_digits);
             auto s_size = std::to_chars(current_key.data(),
                                         current_key.data() +
current_key.size(),
                                         i).ptr - current_key.data();
- current_key.resize(s_size);

- if (node.count(current_key) == 0) {
+ auto it = node.find(std::string_view(current_key.data(), s_size));
+ if (it == node.end()) {
                 json::partial::skip(reader);
                 continue;
             }
@@ -214,7 +213,7 @@ void fill_j(json::reader& reader, Tree& node)
                         }
                     },
                     consume_value),
- node[current_key]);
+ it->second);
         }

         if (reader.symbol() == json::token::symbol::error)
```

The changes are local to the code that parses this structure (only the
`case json::token::symbol::begin_array` "callback" changed). It doesn't
affect the rest. As for the `boostjson` branch, the code for every callback
must now execute new code (if you're curious why we'd need a
`current_key()` helper function):
<https://gitlab.com/vinipsmaker/gawk-jsonstream-archive/-/commit/f4c81abf7e766241cb3729084cf2628eb706267f>.

As an exercise for the reader, try to imagine what changes would be
required in the push parser approach if we wanted to expose a DOM handle in
the `J` AWK variable. As for the pull parser approach we have a call to
`partial::skip()` within the `consume_value()` function. This call would be
replaced with `partial::parse()` and that's the change.

### `transform`

By now I already made my main point. Push parsers don't compose. I'll keep
the explanation for remaining branches brief and only give an outline over
them.

We've made a plug-in to read JSON values. Naturally the next step is to
have the ability to update/write them.

It's not possible to do this with Boost.JSON parser, so I had to give up on
branch boostjson2 and start a new one -- boostjson2-b. `boostjson2-b` is
based on `boostjson-dom` so you can expect similar prohibitive costs from
the DOM layer.

The strategy here is the same as before. We recursively traverse the stored
DOM using the decision tree to guide the whole process. The previous step
had a `consume_value()` function and the new step will have a
`produce_value()` one:

```cpp
static void produce_value(json::value& dom, size_t idx)
{
    awk_value_t idx_as_val; make_number(idx + 1, &idx_as_val);
    awk_value_t value;

    if (!get_array_element(var_j, &idx_as_val, AWK_UNDEFINED, &value)) {
        dom.emplace_null();
        return;
    }
    switch (value.val_type) {
    case AWK_UNDEFINED:
    case AWK_ARRAY:
        dom.emplace_null();
        break;
    case AWK_NUMBER:
        if (value.num_value == 1.0 && dom.is_bool()) {
            dom.emplace_bool() = true;
        } else if (value.num_value == 0.0 && dom.is_bool()) {
            dom.emplace_bool() = false;
        } else {
            // TODO: respect AWK's OFMT
            dom.emplace_double() = value.num_value;
        }
        break;
    case AWK_STRING:
    case AWK_REGEX:
    case AWK_STRNUM: {
        dom.emplace_string() = std::string_view{
            value.str_value.str, value.str_value.len};
        break;
    }
    case AWK_SCALAR:
    case AWK_VALUE_COOKIE:
        assert(false);
    }
}
```

Then we just serialize the whole DOM tree when the AWK function
`json::output()` is called:

```cpp
static awk_value_t* do_output(int /*num_actual_args*/, awk_value_t* result,
                              awk_ext_func_t* /*finfo*/)
{
    query_j(parsed_value, tree);
    auto serialized = json::to_string(parsed_value);
    char* mem = (char*)gawk_malloc(serialized.size() + 1);
    memcpy(mem, serialized.c_str(), serialized.size() + 1);
    return make_malloced_string(mem, serialized.size(), result);
}
```

For the pull parser we modify the read loop to store the document offsets
(this small diff is the part impossible to achieve with Boost.JSON parser):

```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -96,6 +96,11 @@ void fill_j(json::reader& reader, Tree& node)
     // atom values at root level were moved out of this recursive function.

     auto consume_value = [&](size_t idx) {
+ Slice slice;
+ slice.rel_off = reader.literal().data() - cursor;
+ slice.jidx = idx;
+ slice.bool_src = false;
+
         switch (reader.symbol()) {
         case json::token::symbol::end:
         case json::token::symbol::error:
@@ -109,6 +114,8 @@ void fill_j(json::reader& reader, Tree& node)
                 var_j,
                 make_number(idx + 1, &idx_as_val),
                 make_number(reader.value<bool>() ? 1 : 0, &value));
+
+ slice.bool_src = true;
         }
             if (false)
         case json::token::symbol::integer:
@@ -133,12 +140,15 @@ void fill_j(json::reader& reader, Tree& node)
                 make_malloced_string(mem, value.size(), &value_as_val));
         }
         case json::token::symbol::null:
+ slice.size = reader.literal().size();
             if (!reader.next()) throw json::error{reader.error()};
             break;
         case json::token::symbol::begin_array:
         case json::token::symbol::begin_object:
- json::partial::skip(reader);
+ slice.size = json::partial::skip(reader).size();
         }
+ cursor += slice.rel_off + slice.size;
+ slices.push_back(slice);
     };

     switch (reader.symbol()) {
```

And implement AWK `json::output()` to perform simple string concatenation:

```cpp
static awk_value_t* do_output(int /*num_actual_args*/, awk_value_t* result,
                              awk_ext_func_t* /*finfo*/)
{
    size_t output_size = 0;
    for (auto& slice: slices) {
        output_size += slice.rel_off;
        slice.buffer.clear();

        awk_value_t idx_as_val; make_number(slice.jidx + 1, &idx_as_val);
        awk_value_t value;

        if (!get_array_element(var_j, &idx_as_val, AWK_UNDEFINED, &value)) {
            slice.buffer = "null";
            output_size += slice.buffer.size();
            continue;
        }
        switch (value.val_type) {
        case AWK_UNDEFINED:
        case AWK_ARRAY:
            slice.buffer = "null";
            break;
        case AWK_NUMBER:
            if (value.num_value == 1.0 && slice.bool_src) {
                slice.buffer = "true";
            } else if (value.num_value == 0.0 && slice.bool_src) {
                slice.buffer = "false";
            } else {
                // TODO: respect AWK's OFMT
                json::writer writer{slice.buffer};
                writer.value(value.num_value);
            }
            break;
        case AWK_STRING:
        case AWK_REGEX:
        case AWK_STRNUM: {
            json::writer writer{slice.buffer};
            writer.value(json::writer::view_type{
                value.str_value.str, value.str_value.len});
            break;
        }
        case AWK_SCALAR:
        case AWK_VALUE_COOKIE:
            assert(false);
        }
        output_size += slice.buffer.size();
    }
    auto buffer_end = buffer.data() + record_size - record_rt_size;
    output_size += buffer_end - cursor;
    char*const mem = (char*)gawk_malloc(output_size + 1);
    auto out_it = mem;
    auto in_it = buffer.data();
    for (auto& slice: slices) {
        memcpy(out_it, in_it, slice.rel_off);
        out_it += slice.rel_off;
        in_it += slice.rel_off + slice.size;
        memcpy(out_it, slice.buffer.data(), slice.buffer.size());
        out_it += slice.buffer.size();
    }
    memcpy(out_it, cursor, buffer_end - cursor);
    out_it += buffer_end - cursor;
    *out_it = '\0';
    return make_malloced_string(mem, output_size, result);
}
```

It's clear which solution will perform better here (string concatenation of
a few string slices vs serializing the whole DOM tree again), but there is
another interesting difference to highlight. The difference having nothing
to do with pull parsers vs push parsers and everything to do with the
serialization framework. As we can see from the alternative library here,
we're able to serialize the document piece by piece. Boost.JSON has one
serialization facility -- `json::serializer` and its by-products such as
`json::to_string()` -- and that's it. If you have a domain object in the
likes of:

```
struct Reply
{
    int req_id;
    std::vector<std::string> result;
};
```

The only answer for Boost.JSON would be to first convert the whole object
into `json::value` and serialize the resulting DOM tree. That's an
unacceptable answer.

### transform2

For the last developed branches, I demonstrate how partial DOM trees fit in
the design of pull parsers. MongoDB has this `$unwind` operator that will
output a new document for each element in a certain array element of the
input document. So, for the following document:

```
{ "foo" 33 , "bar" : [ 3, 4, "abc" ] }
```

MongoDB's `$unwind` operator can be used to output:

```
{ "foo" 33 , "bar" : 3 }
{ "foo" 33 , "bar" : 4 }
{ "foo" 33 , "bar" : "abc" }
```

For our plug-in, you just fill the index you'd like unwind:

```awk
BEGIN {
    JPAT[1] = "/bar"
    JUNWIND = 1 #< index in JPAT array
}
```

For `boostjson3` branch, the idea is to swap the desired array to a
different place so we can later call `json::output()` cleanly:

```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -74,13 +74,32 @@ static const awk_fieldwidth_info_t zero_fields = {
 static json::parser json_parser;
 static json::value parsed_value;

-static void consume_value(const json::value& v, size_t idx)
+int unwind_idx; //< 1-indexed ; 0 is root ; -1 is unwind unset
+json::array unwinded_values;
+static bool unwind_in_progress;
+static size_t nunwinded;
+
+static void consume_value(json::value& v, size_t idx)
 {
     switch (v.kind()) {
     case json::kind::array:
+ if ((int)idx + 1 == unwind_idx) {
+ v.get_array().swap(unwinded_values);
+ {
+ awk_value_t val;
+ auto ok = sym_update_scalar(
+ var_julength,
+ make_number(unwinded_values.size(), &val));
+ assert(ok); (void)ok;
+ }
+ unwind_in_progress = true;
+ nunwinded = 0;
+ break;
+ }
     case json::kind::object:
         break;
     case json::kind::string: {
```

For `master3` branch, only one change happened to the read loop (again
proving the composability of pull parsers). The idea is to store the unwind
target directly into a domain object (so almost a partial DOM, but even
cheaper). Do notice that only `json::token::symbol::begin_array` "callback"
changed. Every other "callback" stays the same:

```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -146,6 +151,63 @@ void fill_j(json::reader& reader, Tree& node)
             if (!reader.next()) throw json::error{reader.error()};
             break;
         case json::token::symbol::begin_array:
+ if ((int)idx + 1 == unwind_idx) {
+ const char* head = reader.literal().data();
+ if (!reader.next()) throw json::error{reader.error()};
+ for (;;) {
+ if (reader.symbol() == json::token::symbol::end_array)
+ break;
+
+ awk_value_t value;
+
+ switch (reader.symbol()) {
+ case json::token::symbol::end:
+ case json::token::symbol::error:
+ case json::token::symbol::end_object:
+ case json::token::symbol::end_array:
+ assert(false);
+ case json::token::symbol::boolean:
+ make_number(reader.value<bool>() ? 1 : 0, &value);
+ break;
+ case json::token::symbol::integer:
+ case json::token::symbol::real:
+ make_number(reader.value<double>(), &value);
+ break;
+ case json::token::symbol::string: {
+ auto data = reader.value<std::string>();
+ char* mem = (char*)gawk_malloc(data.size() + 1);
+ memcpy(mem, data.c_str(), data.size() + 1);
+ make_malloced_string(mem, data.size(), &value);
+ }
+ break;
+ case json::token::symbol::null:
+ value.val_type = AWK_UNDEFINED;
+ break;
+ case json::token::symbol::begin_array:
+ case json::token::symbol::begin_object:
+ value.val_type = AWK_UNDEFINED;
+ unwinded_values.push_back(value);
+ json::partial::skip(reader);
+ continue;
+ }
+ unwinded_values.push_back(value);
+ if (!reader.next()) throw json::error{reader.error()};
+ }
+ assert(reader.symbol() == json::token::symbol::end_array);
+ const char* tail = reader.literal().end();
+ if (!reader.next()) throw json::error{reader.error()};
+ slice.size = std::distance(head, tail);
+ {
+ awk_value_t val;
+ auto ok = sym_update_scalar(
+ var_julength,
+ make_number(unwinded_values.size(), &val));
+ assert(ok); (void)ok;
+ }
+ unwind_in_progress = true;
+ nunwinded = 0;
+ break;
+ }
```

I have explored a few concepts, but there are many more that were left off
(e.g. backtracking, wrapping parsers, chunking). I'm also familiar with
these topics and my assessment here would be the same: either there is no
loss in the pull parser approach and they both perform well, or the push
parser will perform significantly worse.

## A faster DOM tree

There is another class of parsers that I haven't talked about. If
parsing-to-DOM speed is the only relevant pursuit for this library, then
it's failing too. It should be using destructive parsing
techniques. Decoding strings can be performed in-place to put less pressure
on allocators (and to skip several memory copies). I can expand on this
topic if there is interest.

## Misc

Some assorted issues and observations:

- Boost.JSON serialization framework is non-existing. There's only one
  feature. The functions don't help to generate any serialized JSON
  value. Any value to be serialized must first be converted to the
  allocating variant-like class `json::value`.
- `json::value::get_*()` functions break current convention for boundary
  checking versions of accessor functions in a variant class.
- There is no `std::monostate` overload to initialize `json::value`. On a
  personal note, I'd prefer to see an explicit `json::null_t`. I don't like
  to conflate meaning of terms created for different purposes.
- Documentation for the parser interface is pretty much lacking. Parameters
  are poorly explained and there are plenty of other questions that go
  unanswered. Can the handler pause the parsing loop by returning `false`
  w/o setting any error code (and how'd the stream be resumed)? Can the
  handler callbacks access parser methods such as `depth()`? Can the
  handler callbacks copy the parser object (this would be useful for
  backtracking)?
- The code compiles with warnings.
- The library offers far too little. The soundness of its design can't be
  proven. All it does is converting to/from a variant-like class and every
  other JSON-related use-case is left off. In the future, another library
  in the likes of QtXmlPatterns (but for JSON) will be built but this core
  -- Boost.JSON -- will prove unfit to build anything higher-level and a
  new core will have to be developed from the ground-up because current bad
  design choices.

## Appendix A: Strings through multiple buffers in pull parsers

There are always compromises when it comes to designing programming
abstractions. As for parsers, one of such compromises is a parser that
never "fails" to consume input. Every input given is fully consumed.

The idea here is to avoid "reparsing" old elements from the stream. For
instance, one document may be received as the following two separate
buffers:

```
"na\u00
EFve"
```

The parser may fail to consume the escape sequence thanks to missing
characters and return `nread=3` for the first buffer. The performance
argument in favour of a parser that doesn't fail is irrelevant. There is no
performance gain by not reparsing 4 bytes. If your buffer is that small
that you find yourself constantly reparsing elements then you have a
problem way before the parsing level. The read syscall overhead would
dominate. That's a pathological case. You must never design abstractions
for a pathological case.

I'll refrain from initiating any debate here on what'd constitute a
pathological case for JSON parsers. I just want to make sure the reader has
this in mind and he should be careful to not ask for every feature under
the sun. He must carry a grain of doubt over whether a feature is really
desired. In fact, the parser may not buffer, but [now the consumer does
so](https://github.com/CPPAlliance/json/blob/fc7b1c6fd229e884df09fd45c64c6102d2aa129b/include/boost/json/impl/parser.ipp#L112)
(so much for a chunking parser).

As for the topic on this section: can a pull parser support this case? The
answer will be the same as before (to quote myself):

> But as we've seen before, the only difference between pull and push
> parsers is something else. Any features lateral to this classification
> originate from different taxonomies.

The enabling feature here is not the push interface. Rather it is the
chosen set of generated events for the JSON token stream. If you change the
notified events for the stream, you enable the same feature. For a pull
parser, these are the events it'd have to provide (among other valid
choices):

- `string` (if non-split)
- `string_run_begin`
- `string_run`
- `string_run_end`

## Appendix B: Why gawk?

There are a few reasons why I've chosen a (GNU) AWK plug-in to illustrate
the concepts presented in this review.

First, I don't want to invent a new tool. Artificial examples could always
be created to fabricate an argument. AWK is one of the classical UNIX
tools. It's established and old. From [gawk
manual](https://www.gnu.org/software/gawk/manual/html_node/History.html):

> The original version of `awk` was written in 1977 at AT&T Bell
> Laboratories. In 1985, a new version made the programming language more
> powerful [...] This new version became widely available with Unix System
> V Release 3.1 (1987).

We're talking 40 years of history here.

If the tool is established, that's one reason. Another reason why I've
chosen AWK is that AWK has no JSON processing facilities, so it fills a gap
in an existing tool.

The third reason why I've chosen GNU AWK to create this plug-in is that
gawk already has an extension mechanism. Therefore I don't have the freedom
to modify the AWK processor to accommodate the JSON library
idiosyncrasies. On the contrary, it's the JSON library the one to prove its
flexibility.

Now there's another question to answer here: why take MongoDB as
inspiration to design the gap that would be filled in AWK (such as the
`$unwind` operator)? And the answer will be the same. MongoDB is one of the
JSON-oriented DBMS that got popular back in the NoSQL boom days.

The end result tries to fit in the AWK's mindset and not the other way
around. AWK is a tool designed for tabular data processing. JSON is in no
way tabular, but its structure allows us to build a tabular view for the
underlying data.

There are two main variables in gawk to control how the fields are
extracted:

- `FS`: defines how to find the field _separators_.
- `FPAT`: defines how to find the field _contents_.

The plug-in just follows this trend and introduces a `JPAT` variable that
also instructs how to find the field _contents_. `JPAT` is an array and
each element has a [standard JSON
Pointer](https://tools.ietf.org/html/rfc6901) expression that was designed
exactly for this purpose.

There's one very interesting fact about this outcome. The usage pattern
here is removing from the library the step to process the stream (as the
step that'd happen with the DOM approach) and moving much of the processing
logic to be embedded in the parsing phase itself. That's a continuation
from the trend we saw earlier about merging matching and decoding
steps. It's also the same approach we'd use to build serious JSON
processing tools such as [JMESPath](https://jmespath.org/) (some JMESPath
expressions require a partial DOM, but it's doable to use a hybrid approach
anyway as it has already been demonstrated in the `JUNWIND` code).

## Appendix C: Static decision trees with Boost.Hana

The reader may be interested to know that the decision tree auxiliar to the
parsing process developed for our (GNU) AWK plug-in is not only useful for
AWK but to C++ programs as well. The decision tree strategy may even use
Boost.Hana CT structures to fully unroll the code. It can be used to
implement facilities such as:

```cpp
using hana::literals::operator""_s;
int foo;
std::string bar;
json::scanf(input, "{ foo: { bar: ? }, bar: ? }"_s, foo, bar);
```

That's also possible with Boost.JSON's current parser. But the following is
not:

```cpp
using hana::literals::operator""_s;
Foo foo;
std::string bar;
json::scanf(
    input,
    "{ foo: { bar: ? }, bar: ? }"_s,
    [&foo](json::reader& reader) {
        // ...
    },
    bar);
```

As it has been explained before, push parsers don't compose. And you aren't
limited to root-level scanning. You should have `json::partial::scanf()` to
act on subtrees too. A prototype for this idea can be found at
<https://github.com/breese/trial.protocol/pull/43>.

## Review questions

> Please be explicit about your decision (ACCEPT or REJECT).

REJECT.

> How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?

I've stalled a project of mine for a few weeks just to participate in this
review process. I think I've spent close to 50 hours.

I wrote a JSON tool in Boost.JSON and the same tool under a different
parser model to highlight important traits that were being dismissed in
previous debates. I've gone through the documentation and source code to
better understand some Boost.JSON details.

I've also been involved in previous Boost.JSON debates in this very mailing
list.

> Are you knowledgeable about the problem domain?

I've tested many JSON parser libraries in C++ to this point. In my last job
I have leveraged a JSON (pull) parser to merge multiple validation layers
into an one-pass operation and later leveraged Boost.Hana to automate the
parser generation. Our API is JSON-based and we apply plenty of
transformations (including DOM-based when unavoidable), some similar to the
GNU AWK plug-in demonstrated here (e.g. synthesizing a new field on routed
messages while DOM tree related costs are avoided).

I've contributed a few algorithms and subtree parsing ideas back to the
leveraged JSON (pull) parser. I offered the same composability ideas to
Boost.JSON a few months back when the topic appeared here in this very
mailing list.

In the past, I've designed parsers in C++ -- pull and push parsers. I've
played with some tools for parser generation such as re2c, gperf (gperf
might not be really parsing, but it is related), and Boost.Xpressive. I've
also experienced the pain of parsers that try to be too lightweight but
ultimately backfire by moving not only performance costs back to the
consumer but also greatly increasing complexity.

--
Vinícius dos Santos Oliveira
https://vinipsmaker.github.io/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk