Boost logo

Boost :

From: Dominique Devienne (ddevienne_at_[hidden])
Date: 2019-09-23 18:36:05


On Mon, Sep 23, 2019 at 6:12 PM Glen Fernandes via Boost <
boost_at_[hidden]> wrote:

> Dominique explained some of the pull (stax) / push (sax) terminology to me
> off-list, and I agree. This does appear to be the more appealing
> underlying facility.
>

I didn't realize it was off-list, usually plain Reply goes to the list.
But doesn't matter, Bjorn explained it better than me anyway.

On Mon, Sep 23, 2019 at 6:11 PM Vinnie Falco via Boost <
boost_at_[hidden]> wrote:

> On Mon, Sep 23, 2019 at 8:58 AM Bjorn Reese via Boost
> <boost_at_[hidden]> wrote:
> > ...online parser...
> > A push parser (SAX)...
> > A tree parser (DOM)
>
> I have no experience with these terms other than occasionally coming
> across them in my Google searching adventures. The parsers that I have
> written take as input one or more buffers of contiguous characters,
> and produce as "output" a series of calls to abstract member functions
> which are implemented in the derived class. These calls represent
> tokens or events, such as "key string", "object begin", "array end".
> So what would we call this in the taxonomy above?
>

That's a PUSH parser IMHO. The doc on Qt's XML PULL parser
should make that clearer perhaps:
https://doc.qt.io/qt-5/qxmlstreamreader.html#details

Many of these terms originated in the XML world, and many (like SAX) from
the Java world too.

To give you a feel for it, here's my PUSH parser API:

class JSONHandler {
public:
...
    virtual bool handle_object_begin();
    virtual bool handle_object_key(const std::string& key);
    virtual bool handle_object_end();

    virtual bool handle_array_begin();
    virtual bool handle_array_end();

    virtual bool handle_number(int);
    virtual bool handle_number(int64_t);
    virtual bool handle_number(uint64_t);
    virtual bool handle_number(double value);
    virtual bool handle_string(const std::string& value);
    virtual bool handle_boolean(bool value);
    virtual bool handle_null();
...
};
bool json_parse(const char* json_utf8_text, size_t len, JSONHandler&
handler);

While that's my PULL parser API:

enum JSONParsingEventType {
    //! Special end-of-document token.
    JSON_END = 0,

    // Value tokens.
    JSON_NULL,
    JSON_TRUE,
    JSON_FALSE,
    JSON_STRING,
    JSON_NUMBER,

    JSON_OBJECT_BEGIN,
    JSON_OBJECT_KEY,
    JSON_OBJECT_END,

    JSON_ARRAY_BEGIN,
    JSON_ARRAY_END,
...
};

class JSONReader {
public:
    JSONReader(
        const char* json_utf8_text, size_t len,
        const JSONParserOptions& options = JSONParserOptions()
    );
    ~JSONReader();

    JSONParsingEventType peek() const;
    JSONParsingEventType next();
    JSONParsingEventType current() const;

    size_t skip_next();
    size_t skip_current();

    JSONToken token();
    size_t depth();
    size_t count();

    bool is_integral();

    int get_int();
    int64_t get_int64_t();
    uint64_t get_uint64_t();

    float get_float();
    double get_double();

    std::string get_string();
    std::string get_string_or_null();

    bool get_boolean();

    std::string get_key();
    bool is_key(const char* key);
    bool is_key(const char* key, size_t len);
...
};

where JSONToken is basically a std::string_view-like object into the raw
JSON doc bytes,
with low-level info for more control, about seeing a numeric sign,
fractional point, or exponent,
or about strings having escaped characters, including unicode ones, i.e.
can't be used as-is,
must be decoded according to JSON rules to get back UTF-8 text.

The former parser "pushes" information at you, the client code. The parser
does the looping.
While in the latter, the client code is in the driver seat and does the
loop, and controls the parser,
extracting information out of it. There's also no inheritance necessary
with a PULL parser, virtual or static-CRTP.

As Bjorn wrote, a PULL parser is the lowest level building block, and the
most convenient one to use.

A PULL parser is typically passed around to code decoding various data
structures, to instantiate
them and their "children/descendants" from the infoset in the JSON doc. To
make that safe from misbehaving code,
I added concepts like "scopes" and "savepoints", so that the function you
pass the reader to
cannot step out of the current object, and to allow the caller code to
recover by "rewinding" the doc to
before the misbehaving reader, skip that object, and try the next one.
Which means I also basically
support incremental parsing too, even though I don't have an API for it, as
obvious from above.

Many parsers also have safeguards and "limits" in terms of depth of the
stack, or maximum
size allowed for strings, which are configured here via the
JSONParserOptions struct.

Anyways, I'm just showing this to illustrate differences between parsers.
There are much
better and faster parsers than mine. I learned a lot building them though,
it was fun. Mine
is comparable to nlohmann in terms of performance, i.e. not that fast :).
--DD


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk