Boost logo

Boost :

From: Gavin Lambert (boost_at_[hidden])
Date: 2020-09-22 23:47:07


On 23/09/2020 05:55, Vinnie Falco wrote:
> On Tue, Sep 22, 2020 at 1:13 AM Gavin Lambert wrote:
>> Regarding the SAX parser specifically, I dislike the *_part methods in
>> the callback interface. I feel that these should be removed and only
>> the final form retained, called only when the complete text of the value
>> has been determined.
>
> The parser (and serializer) support incremental / streaming operation.
> To provide the "complete" text would require allocating memory and
> saving the partial result until enough input has been supplied to
> finish the text. basic_parser does not currently allocate memory,
> which was a design goal, and this would change that.

Not necessarily. A small buffer can be statically allocated for a
single value-in-progress. It's trivial for all numeric values (except
perhaps pathological precision, but that can be ignored). It's harder
for string values; that might have to support dynamic allocation, it's true.

I just don't see the point in having an interface that can return half
of a value -- this just moves the problem of allocation and keeping
track of the full value to the parser handler, which seems like an SRP
violation.

>> I don't feel it is ever sensible to call back with
>> a partially-parsed value (unless I am misunderstanding what these are
>> for, but there is little explanation in the docs, and I wasn't able to
>> get them called at all in my test).
>
> In order to see partial results, you have to invoke the parser with
> partial JSON. In other words, break up the input JSON into two or more
> consecutive buffers of characters, and supply them in order.

I tried that, but the parse simply failed at the end of the first
partial buffer and it completely ignored the second buffer. Hence my
statement that basic_parser did not operate incrementally. Though see
below; this appears to be a failure of example rather than of
implementation.

>> I also don't see the point of supplying a size argument to on_key* and
>> on_string, as this can be read from the string_view argument instead.
>
> Right, so this size argument is actually the total number of
> characters, while the string_view::size is the number of characters in
> the current piece.

Which is a poor design choice, as I said.

>> parser.write() is incremental but basic_parser.write() is not, which
>> seems odd. I assume this means that parser is storing an incrementally
>> populated buffer of the full document rather than _really_ incrementally
>> parsing.
>
> Everything is incremental. parser::write() has the condition that if
> there are any leftover characters after receiving a complete JSON, an
> error is returned. Contrast this with parser::write_some which does
> not produce an error in this case. basic_parser::write is more similar
> to parser::write_some. We will be renaming it to
> basic_parser::write_some, as another reviewer suggested, to make this
> more clear. There is also a synergy with asio's naming with that
> change.

I wrote a print_parser (which is just like the null_parser example but
prints the method calls and parameters) and invoked it like so:

        json::error_code ec;
        print_parser parser;
        std::string data1 = R"j({"foo":14,"bar":1)j";
        std::string data2 = R"j(8,"baz":"bo)j";
        std::string data3 = R"j(b"})j";
        parser.write(&data1[0], data1.size(), ec);
        parser.write(&data2[0], data2.size(), ec);
        parser.write(&data3[0], data3.size(), ec);

It stopped after calling on_int64(1, "1") and did not continue parsing,
completely ignoring the second buffer, never calling any _part methods
and reporting the wrong integer value. It parses correctly only if both
are supplied. Hence, as I said, it is not behaving incrementally.

However I have since noticed that the missing link is that it must pass
'true' as the 'more' parameter to basic_parser.write, which is not
demonstrated in any example; it may be helpful to show that. (In fact
this resolves my other concern about multi-document parsing as well,
which also could use an explicit example.)

Once this is done, then (as expected by the docs) it reports these calls
(in part):

     on_number_part("1")
     on_int64(18, "8")

     on_string_part("bo", 2)
     on_string("b", 3)

So I reiterate that the number and string handling is inconsistent and
these part methods are more confusing than helpful -- numbers are parsed
into the complete value but strings are not, and neither give you the
full text of the value (which may be important if the goal is to parse
into an arbitrary-precision value type).

I do regard this as a very poor design choice, but not sufficiently so
to reject the library.

(I still also regard SAX-style parsing itself as a poor design choice
[vs pull parsing], due to poor composability, but that is not a failure
of the library itself, except perhaps in offering it at all. But I can
understand why.)

> Well, yeah, that's exactly what it is. Consider a network server that
> delivers JSON to clients. We don't want a completion handler to be
> stuck serializing 100MB of JSON while another client is waiting who
> only needs 1MB serialized. The way to ensure fairness here is to
> serialize incrementally with some reasonable buffer size, say 32kb.
> The light client will finish in ~33 i/o cycles while the heavy client
> still makes progress.

I was assuming that a "deep" JSON (many nested objects) may have a
different cost per output character than a wide but shallow one. But
it's not a big deal.

> To store the comments would mean storing them in the DOM, which would
> have a significant impact on performance as the space for everything
> is tightly controlled.

I don't see why. It's just an extra string associated with any value.

FWIW, I don't think it needs to *precisely* round-trip the comments --
it's ok to convert comments to /**/ style (that's mandatory if the
output isn't pretty-printed) and to consolidate all comments associated
with one value to a single comment that's always output after that value
(or perhaps have 'before' and 'after' slots). This should be much more
manageable.

As it stands, the library is completely useless for the case of writing
a JSON config file intended for human use. It's ok if you don't want
that to be a goal, but this should be made clear up front so that people
don't waste time considering the library for use cases it's not intended
for. (Though it would be preferable if you did support it.)

>> The basic_parser examples do not compile out of the box -- with the
>> suggested json.hpp and src.hpp includes there are link errors with
>> basic_parser symbols. Additionally including basic_parser.hpp resolves
>> these but starts complaining about missing
>> max_{key,string,array,object}_size members of the handler type, which
>> are not documented. After these are supplied then it works as expected.
>> I don't see why these should be needed, however.
>
> Those max sizes should be documented in the handler exemplar:
>
> <https://master.json.cpp.al/json/ref/boost__json__basic_parser.html#json.ref.boost__json__basic_parser.handler0>

The docs I am referring to are the ones referenced in the original
pre-announcement post (although now I notice that the main announcement
post has a different link):

 
http://vinniefalco.github.io/doc/json/json/ref/boost__json__basic_parser.html

This is also where the "using numbers" section is entirely blank.

> The reason they are required is for optimization. Handling the limit
> in the basic_parser leads to performance gains.

I don't really understand why. basic_parser doesn't actually store any
of the keys/values (especially as it doesn't merge value parts), so it
should have no reason to impose any kind of limit. This just seems like
it's doing an extra count/comparison that could be removed entirely, and
removing it would naturally improve performance.

Unless you're saying that moving the check to basic_parser improves the
performance of parser, in which case: you're doing it wrong. Imposing
restrictions on all users of basic_parser for something that only parser
cares about is the wrong design choice.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk