Boost logo

Boost :

From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2020-09-18 08:57:03


On 17/09/2020 20:30, Vinícius dos Santos Oliveira via Boost wrote:

> As it has been explained before, push parsers don't compose. And you aren't
> limited to root-level scanning. You should have `json::partial::scanf()` to
> act on subtrees too. A prototype for this idea can be found at
> <https://github.com/breese/trial.protocol/pull/43>.

Firstly, that was a great essay on the backing theory Vinicius. I only
wish more people who write parsers would read that first. I would urge
you to convert it into a blog post or something similar, post it online
so people can find it, and all that great explanation of theory doesn't
get lost forever.

> ## Review questions
>
>> Please be explicit about your decision (ACCEPT or REJECT).
>
> REJECT.

I understand your motivation here, and given nobody else on this list
will say this here, you are absolutely right that in the general case,
pull based parsers are the right choice. I chose a pull design for pcpp
(the pure Python C preprocessor), despite that strictly speaking it's
really not necessary for parsing contents whose length is always fully
known in advance. However, pull based designs are just better in
general, more flexible, more extensible.

In the very specific case of parsing JSON however, I'm not sure if the
standard rules of evaluation apply. The author of sajson claims that
most of his speed comes from not being a pull parser. What you do is
zero copy DMA the incoming socket data into a memory mapped buffer,
execute sajson's AST parse upon that known sized memory mapped buffer
which encodes the AST directly into the source by modifying the buffer
in place to avoid dynamic memory allocations completely, and voila bam
there's your JSON parsed with a strict minimum of memory copied or cache
lines modified. He claims, and I have no reason to doubt him, that
because he can make these hard coded assumptions about the input buffer,
he was able to make a very fast JSON parser (amongst the fastest
non-SIMD parsers). By inference, a pull parser couldn't be as fast.

I find that explanation by sajson's author compelling. The fact he
completely avoids dynamic memory allocation altogether, and builds the
AST inline into the original buffer of JSON, is particularly compelling.

I haven't looked at Boost.JSON. But it seems to target a more C++
idiomatic API, be pluggable for other formats like Boost.Serialisation,
but retain most of the performance of JSON parsers such as sajson or
simdjson. As Boost reviews primarily review API design, Boost.JSON's
choice of approach fits well for the process here. Boost prefers purity
over performance.

Personally speaking, for JSON I care solely and exclusively about
maximum possible parse speed. I have no use for customisation or
extensibility. If Boost.JSON beats sajson in in-place AST building and
it also beats simdjson, I'll use it. If it doesn't, I won't.

I suspect most users of JSON by far would have the exact same attitude
as I do. For users like us, we really don't care what the parser does,
or how it is designed, or whatever crappy API it might have, all we care
about is maximum possible data extraction performance. Never ever
calling malloc is an excellent sign of the right kind of JSON parser
design, at least in my book.

Niall


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk