Boost logo

Boost :

From: Ruben Perez (rubenperez038_at_[hidden])
Date: 2022-05-12 19:02:48


> Indeed. But even your imagined batch interface only works well for
> queries/selects,
> while for inserts (or updates), the client does not need to just send
> a small textual
> SQL query, but potentially a bunch of data for the rows too. A true
> pipeline allows
> sending the rows for a second insert while the first insert is being processed.

It would work for inserts too, as values are either part of the query string
or part of the statement execute packet. In both cases, part of the request.

> Such a mode may be useful for schema creation OTOH. We have large schemas,
> with hundreds of tables, indexes, triggers, etc... Done from C++ code
> client-side,
> not via a DBA manually executing on the server using SQL files and the
> native CLI.
> For that use-case, the ability to send many DDLs in a single batch
> would definitely
> save on the round-trips to the server. We try hard to minimize roundtrips!

Thanks for sharing this use case, I definitely wasn't aware of it
and seems a reason towards implementing multi-statement.

> You don't need to necessarily use Google's protobufs library.
> There's https://github.com/mapbox/protozero for example, and similarly, a
> from-scratch implementation to just encode-decode a specific protocol
> can also be written.

I wasn't aware of this.Thanks.

> > The server sends several resultsets after that. I haven't focused a lot on
> > this because it sounded risky (in terms of security) for me.
>
> Not sure what's risky here. Maybe I'm missing something.

I was just imagining users concatenating queries. May be a misconception,
yours is a legitimate use case.

> Note that PostgreSQL's COPY is not file-specific, and a first-class citizen
> at the protocol level, depending on a pure binary format (with text mode too).
> I use COPY with STDIN / STDOUT "files", i.e. I prepare memory buffers and
> send them; and read memory buffers, and decode them. No files involved.

Unfortunately, that does not seem to be the case for MySQL. You issue the
LOAD DATA statement via a regular query packet, and the server returns
you another packet with the file path it wants. You then read it in the client
and send it to the server in another packet. I've made it work with CSV files,
and I'd say it's the only format allowed, AFAIK.

> Maybe our use-case of very large data (with large blobs), *and* very
> numerous smaller data,
> that's often loaded en-masse, by scientific desktop applications, and
> increasingly mid-tier services
> for web-apps, is different from the more common ecommerce web-app
> use-case of many people.

It is good to hear different use cases than the regular web server
we all have in mind. It will help me during further development.

>
> When I evaluate DB performance, I tend to concentrate on "IO" performance,
> in terms of throughput and latency, independent of the speed of the
> SQL engine itself.
> There's nothing I can do about the latter, while the way one uses the Client API
> (or underlying protocol features) is under my control. So I do mostly
> INSERTs and
> SELECTs, with and without WHERE clauses (no complex JOINs, CTEs, etc...).
>
> Because we are in the scientific space, we care about both many small rows
> (millions, of a few bytes to KBs each at most), and a few (hundreds / thousands)
> much larger rows with blobs (with MBs to GBs sizes). The truly large
> "blobs" (files) are
> kept outside the DB, since mostly read only (in the GBs to TBs sizes each, that
> can accumulate to 2.6 PB for all a client's data I heard just
> yesterday for example).
>
> I'll also compare inserting rows 1-by-1, with and without prepared statements,
> to inserting multi-rows per-statement (10, 100, 1000 at a time), to the "bulk"
> interface (COPY for PostgreSQL, LOCAL for MySQL, Direct-Path load in OCI).
> For example with SQLite:
> https://sqlite.org/forum/forumpost/baf9c444d9a38ca6e59452c1c568044aaad50bbaadfff113492f7199c53ecfed
> (SQLite as no "bulk" interface, doesn't need one, since "embedded"
> thus "zero-latency")
>
> For PostgreSQL, we also compared text vs binary modes (for binds and
> resultsets).
>
> For blobs, throughput of reading and writing large blobs, whole and in-part,
> with different access patterns, like continuous ranges, or scattered inside).
>
> A very important use-case for us, for minimizing round-trips, is how
> to load a subset of rows,
> given a list of primary keys (typically a surrogate key, like an
> integer or a uuid). For that,
> we bind a single array-typed value for the WHERE clause's placeholder
> to the query,
> and read the resultset, selecting 1%, 5%, 10%, etc... of the rows from
> the whole table.
> (selecting each row individually, even with a prepared query, is terribly slow).

Thanks for sharing. It will definitely help me during benchmarking.

Regards,
Ruben.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk