Boost logo

Boost :

From: Vinnie Falco (vinnie.falco_at_[hidden])
Date: 2020-09-23 00:52:39


On Tue, Sep 22, 2020 at 4:47 PM Gavin Lambert via Boost
<boost_at_[hidden]> wrote:
> I just don't see the point in having an interface that can return half
> of a value -- this just moves the problem of allocation and keeping
> track of the full value to the parser handler, which seems like an SRP
> violation.

It is the opposite of SRP violation. basic_parser is concerned only
with parsing and not keeping track of the full value. The
responsibility of tracking full values is moved to the caller. That
way, basic_parser does not impose its opinions on how the full value
should be stored. This was litigated in earlier posts if I recall
(Peter?).

> >> I don't feel it is ever sensible to call back with
> >> a partially-parsed value (unless I am misunderstanding what these are
> >> for, but there is little explanation in the docs, and I wasn't able to
> >> get them called at all in my test).

The interfaces are built up in layers. At the lowest layer you have
basic_parser which concerns itself only with processing the input
buffer and turning it into semantic components. It does _not_ have the
responsibility of buffering input to present it as a larger component,
that's the job of the handler. basic_parser is exposed to users so
they can use their own schemes for dealing with the semantic
components.

Above basic_parser, you have parser which uses value_stack. The
value_stack does take responsibility for buffering input, and
producing DOM values. As with basic_parser, both of these components
are exposed to users so they can write their own layers on top.

On top of parser we have the parse() free functions, which address the
most popular use-case of parsing. These functions will satisfy most
needs. And for the needs that are not satisfied, users can easily
implement their own parse functions with different behaviors, as the
library exposes all the intermediate lower levels.

> > Right, so this size argument is actually the total number of
> > characters, while the string_view::size is the number of characters in
> > the current piece.
>
> Which is a poor design choice, as I said.

No, the information is already there so we are just passing it to the
handler which can use it, or not. If the handler doesn't use it, then
it is optimized away.

> parser.write(&data1[0], data1.size(), ec);
> parser.write(&data2[0], data2.size(), ec);
> parser.write(&data3[0], data3.size(), ec);

You have to call basic_parser::reset in order to start parsing a new "document."

> So I reiterate that the number and string handling is inconsistent and
> these part methods are more confusing than helpful -- numbers are parsed
> into the complete value but strings are not, and neither give you the
> full text of the value (which may be important if the goal is to parse
> into an arbitrary-precision value type).
>
> I do regard this as a very poor design choice, but not sufficiently so
> to reject the library.

No it is not a poor design choice, it is the correct design choice. If
basic_parser were to buffer the input, then users who have no need of
buffered input (such as value_stack) would be paying for it for
nothing. This interface is driven first and foremost for performance
and the incremental/streaming functionality. If callers want to see
full text, they can append the partial strings into a data member
until the final on_string or on_int64 / on_uint64 / on_double and then
interact with it that way. But they are not obliged to do so.

Again I repeat, this is not a poor design choice. json::string for
example is perfectly capable of being constructed from two separate
buffers of characters (an implementation detail):

<https://github.com/CPPAlliance/json/blob/6ddddfb16f9b54407d153abdda11a29f74642854/include/boost/json/impl/value_stack.ipp#L428>

If we were to follow your "correct design" to make basic_parser buffer
the input, the implementation would be needlessly allocating memory
and copying characters.

> I was assuming that a "deep" JSON (many nested objects) may have a
> different cost per output character than a wide but shallow one. But
> it's not a big deal.

Yes, it isn't perfect of course but over time and with a reasonable
distribution of possible JSON inputs it averages out to be good
enough. Certainly better than not having streaming at all (which would
be terrible).

> > To store the comments would mean storing them in the DOM, which would
> > have a significant impact on performance as the space for everything
> > is tightly controlled.
>
> I don't see why. It's just an extra string associated with any value.

Yes exactly. sizeof(value)==16 on 32bit platforms, and
sizeof(value)==24 on 64-bit platforms. When arrays and objects are
constructed during parsing, the implementation performs a memcpy after
allocation. Every additional byte in the sizeof(value) slows down the
creation of arrays and objects. Adding a pointer to value would cost
25% on 32-bit and 33% on 64-bit, and make Boost.JSON unable to ever
beat RapidJSON in performance.

> This just seems like
> it's doing an extra count/comparison that could be removed entirely, and
> removing it would naturally improve performance.

If you don't want a limit then just set the max to `size_t(-1)` and
the compiler will elide the check completely.

> Unless you're saying that moving the check to basic_parser improves the
> performance of parser, in which case: you're doing it wrong. Imposing
> restrictions on all users of basic_parser for something that only parser
> cares about is the wrong design choice.

The design of these components balances usability with performance. If
you don't care about limits you can just set the max to size_t(-1).
And I think this is the right choice. The library makes common things
easy, e.g. call parse(). If you need more complexity, you instantiate
a `parser` whose API requires a few more steps. If you really want to
go low level, you use `basic_parser` which being low level can seem
esoteric.

And I think this is okay, because the percentage of users who will use
basic_parser is incredibly tiny. The percentage of `parser` users much
greater. While the percentage of users of the parse() free function
will be closer to 100% than anything else. In other words the library
favors performance and utility at the lowest levels over theoretical
interface correctness, where the number of users is naturally smaller
(and their domain-expertise naturally higher).

Regards


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk