Boost logo

Boost :

From: Ruben Perez (rubenperez038_at_[hidden])
Date: 2022-05-12 18:50:22


On Wed, 11 May 2022 at 20:23, Phil Endecott via Boost
<boost_at_[hidden]> wrote:
>
> Here is my review of Ruben Perez's proposed MySql library.

Hi Phil, thank you for taking your time to write a review.

>
> Background
> ----------
>
> I have previously implemented C++ wrappers for PostgreSQL and
> SQLite, so I have some experience of what an SQL API can look
> like. I know little about ASIO.
>
> I have also recently used the AWS SDKs for C++ and Javascript to
> talk to DynamoDB; this has async functionality, which is interesting
> to compare.
>
> I confess some minor disappointment that MySql, rather than
> PostgreSQL or SQLite, is the subject of this first Boost database
> library review, since those others have liberal licences that
> are closer to Boost's own licence than MySql (and MariaDB). But
> I don't think that should be a factor in the review.
>
>
> Trying the library
> ------------------
>
> I have tried using the library with
>
> - g++ 10.2.1, Arm64, Debian Linux
> - ASIO from Boost 1.74 (Debian packages)
> - Amazon Aurora MySql-compatible edition
>
> I've written a handful of simple test programs. Everything works
> as expected. Compilation times are a bit slow but not terrible.
>
>
>
> The remainder of this review approximately follows the structure
> of the library documentation.
>
>
> Introduction
> ------------
>
> I note that "Ease of use" is claimed as the first design goal,
> which is good.

I think I failed to make the scope of the library clear enough in this
aspect. The library is supposed to be pretty low level and close to the
protocol, and not an ORM. I list ease of use here in the sense that

* I have tried to abstract as much of the oddities of the protocol
as possible (e.g. text and binary encodings).
* The library takes care of SSL as part of the handshake, vs
having the user have to take care of it.
* The library provides helper connect() and close() functions
to make things easier.
* The object model is as semantic as I have been able to achieve,
vs. having a connection object and standalone functions.
* The value class offers stuff like conversions to make some use-cases simpler.

I guess I listed that point in comparison
to Beast or Asio, which are even lower level. Apologies if it caused
confusion.

>
> I feel that some mention should be made of the existing C / C++
> APIs and their deficiencies. You should also indicate whether or
> not the network protocol you are using to communicate with the
> server is a "public" interface with some sort of stability
> guarantee. (I guess maybe it is, if it is common to MySql and
> MariaDB.)

Updated https://github.com/anarthal/mysql/issues/50
on comparison with other APIs.

The network protocol is public and documented
(although the documentation is pretty poor). It's indeed
a pretty old protocol that is not being extended right now,
and it's widely used by a lot of clients today, so there is
very little risk there.

>
>
> Tutorial
> --------
>
> The code fragments should start with the necessary #includes,
> OR you should prominently link to the complete tutorial source
> code at the start.

Raised https://github.com/anarthal/mysql/issues/71
to track it.

>
> You say that "this tutorial assumes you have a basic familiarity
> with Boost.Asio". I think that's unfortunate. It should be
> possible for someone to use much of the library's functionality
> knowing almost nothing about ASIO. Remember your design goal of
> ease-of-use. In fact, it IS possible to follow the tutorial with
> almost no knowledge of ASIO because I have just done so.

If you are to really take advantage of the library (i.e. use
the asynchronous API), you will need some Asio familiarity.
I'd say a very basic understanding is enough (i.e. knowing
what a io_context is). If you think this comment is misleading,
I can remove it. But I don't think this is the right place to
provide a basic Asio tutorial.

>
> You have this boilerplate at the start of the tutorial:
>
> boost::asio::io_context ctx;
> boost::asio::ssl::context ssl_ctx (boost::asio::ssl::context::tls_client);
> boost::mysql::tcp_ssl_connection conn (ctx.get_executor(), ssl_ctx);
> boost::asio::ip::tcp::resolver resolver (ctx.get_executor());
> auto endpoints = resolver.resolve(argv[3], boost::mysql::default_port_string);
> boost::mysql::connection_params params (
> argv[1], // username
> argv[2] // password
> );
> conn.connect(*endpoints.begin(), params);
> // I guess that should really be doing something more
> // intelligent than just trying the first endpoint, right?

The way to go here is providing an extra overload for
connection::connect. Raised https://github.com/anarthal/mysql/issues/72
to track it.

>
> I would like to see a convenience function that hides all of that:
>
> auto conn = boost::mysql::make_connection( ...params... );
>
> I guess this will need to manage a global, private, ctx object
> or something.

If you take a look to any other asio-based program, the user is
always in charge of creating the io_context, and usually in charge of
creating the SSL context, too. If you take a look to this Boost.Beast example,
you will see similar stuff:
https://www.boost.org/doc/libs/1_79_0/libs/beast/example/http/client/sync-ssl/http_client_sync_ssl.cpp

I'm not keen on creating a function that both resolves the hostname
and connects the connection, as I think it encourages doing more
name resolution than really required (you usually have one server
but multiple connections). I may be wrong though, so I'd like to
know what the rest of the community thinks on this.

> .port = 3306, // why is that a string in yours?

It is not "mine", it's just how Asio works. Please have a look at
https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/ip__basic_resolver/resolve.html

> make_connection("mysql://admin:12345_at_hostname:3306/dbname");

I guess you're suggesting that make_connection also perform
the name resolution, the physical connect and the MySQL handshake?

I'm not against this kind of URL-based way of specifying parameters.
I've used it extensively in other languages. May be worth
reconsidering it once Vinnie's Boost.Url gets accepted.

>
> Now... why the heck does your connection_params struct use
> string_views? That ought to be a Regular Type, with Value
> Semantics, using std::strings. Is this the cult of not using
> strings because "avoid copying above all else"?

I may have been a little too enthusiastic about optimization here.

>
> Another point about the connection parameters: you should
> provide a way to supply credentials without embedding them
> in the source code. You should aim to make the secure option
> the default and the simplest to use. I suggest that you
> support the ~/.my.cnf and /etc/my.cnf files and read passwords
> etc. from there, by default. You might also support getting
> credentials from environment variables or by parsing the
> command line. You could even have it prompt for a password.

I don't know of any database access library that does this.
The official Python connector gets the password from a string.
I think this is mixing concerns. Having the password passed as
a string has nothing to do with having it embedded in the source code.
Just use std::getenv, std::stdin or whatever mechanism your application
needs and get a string from there, then pass it to the library.
All the examples read the password from argv.
Additionally, having passwords in plain text files like
~/.my.cnf and /etc/my.cnf is considered bad practice in terms
of security, I wouldn't encourage it.

>
> Does MySQL support authentication using SSL client certs?
> I try to use this for PostgreSQL when I can. If it does, you
> should try to support that too.

AFAIK you can make the server validate the client's certificate
(that doesn't require extra library support), but you still have to
pass a password.

>
> About two thirds of the way through the tutorial, it goes from
> "Hello World" to retrieving "employees". Please finish the hello
> world example with code that gets the "Hello World" string
> from the results and prints it.

My bad, that's a naming mistake - it should be named
hello_resultset, instead. It's doing the right thing with the wrong
variable name.
Updated https://github.com/anarthal/mysql/issues/71

>
>
> Queries
> -------
>
> I encourage you to present prepared queries first in the
> documentation and to use them almost exclusively in the tutorial
> and examples.

It can definitely make sense.

>
> You say that "client side query composition is not available".
> What do you mean by "query composition"? I think you mean
> concatenating strings together'); drop table users; -- to
> form queries, right? Is that standard MySql terminology? I
> suggest that you replace the term with something like
> "dangerous string concatenation".

Yes, I mean that.

>
> In any case, that functionality *is* available, isn't it!
> It's trivial to concatenate strings and pass them to your
> text query functions. You're not doing anything to block that.
> So what you're really saying is that you have not provided any
> features to help users do this *safely*. I think that's a serious
> omission. It would not be difficult for you to provide an
> escape_for_mysql_quoted_string() function, rather than having
> every user roll their own slightly broken version.

Definitely not trivial (please have a look at MySQL source code)
but surely beneficial, see below.
Tracked by https://github.com/anarthal/mysql/issues/69.

>
> IIRC, in PostgreSQL you can only use prepared statements
> for SELECT, UPDATE, INSERT and DELETE statements; if you
> want to do something like
>
> ALTER TABLE a ALTER COLUMN c SET DEFAULT = ?
> or CREATE VIEW v as SELECT * FROM t WHERE c = ?

You are right, these cases aren't covered by prepared statements.
https://github.com/anarthal/mysql/issues/69 tracks it.

> tcp_ssl_prepared_statement is verbose. Why does the prepared
> statement type depend on the underlying connection type?

Because it implements I/O operations (execute() and close()),
which means that it needs access to the connection object,
thus becoming a proxy object.

> I have to change it if I change the connection type?! If that's
> unavoidable, I suggest putting a type alias in the connection
> type:
>
> connection_t conn = ....;
> connection_t::prepared_statement stmt(.....);

Raised https://github.com/anarthal/mysql/issues/73

>
> Does MySql allow numbered or named parameters? SQLite supports
> ?1 and :name; I think PostgreSQL uses $n. Queries with lots of
> parameters are error-prone if you just have ?. If MySql does
> support this, it would be good to see it used in some of the
> examples.

Not AFAIK, just regular positional placeholders.

>
> Invoking the prepared statement seems unnecessarily verbose.
> Why can't I just write
>
> auto result = my_query("hello", "world", 42);

Because this invokes a network operation. By Asio convention,
you need a pair of sync functions (error codes and exceptions)
and at least an async function, that is named the same as the sync
function but with the "_async" suffix.

I'm not against this kind of signature, building on top of what
there already is:

statement.execute("hello", "world", 42);
statement.async_execute("hello", "world", 42, use_future);

Which saves you a function call. Raised
https://github.com/anarthal/mysql/issues/74

>
> I also added Query variants where the result is expected to be
>
> - A single value, e.g. a SELECT COUNT(*) statement.
> - Empty, e.g. INSERT or DELETE.
> - A single row.
> - Zero or one rows .
> - A single column.

I think this can be useful. I've updated
https://github.com/anarthal/mysql/issues/22
to track this.

> I don't see anything about the result of an INSERT or UPDATE.
> PostgreSQL tells me the number of rows affected, which I have
> found useful to return to the application.

Please have a look at
https://anarthal.github.io/mysql/mysql/resultsets.html#mysql.resultsets.complete

>
>
> resultset, row and value
> ------------------------
>
> I'm not enthusiastic about the behaviour nor the names of these
> types:

Resultset is how MySQL calls it. It's not my choice.

>
> - resultset is not a set of results. It's more like a sequence of
> rows. But more importantly, it's lazy; it's something like an
> input stream, or an input range. So why not actually make it
> an input range, i.e. make it "a model of the input_range concept".
> Then we could write things like:
>
> auto result = ...execute query...
> for (auto&& row: result) {
> ...
> }

How does this translate to the async world?

>
> - row: not bad, it does actually represent a row; it's a shame
> it's not a regular type though.
>
> - value: it's not a value! It doesn't have value semantics!

If the library gets rejected I will likely make values
owning (regular).

> I'm also uncertain that a variant for the individual values
> is the right solution here. All the values in a column should
> have the same type, right? (Though some can be null.) So I
> would make row a tuple. Rather than querying individual values
> for their type, have users query the column.

Are you talking about something like this?
https://github.com/anarthal/mysql/issues/60

>
> It seems odd that MySQL small integers all map to C++ 64-bit
> types.

It is done like this to prevent the variant from having
too many alternatives - I don't think having that would add
much value to the user. If I implement something like
https://github.com/anarthal/mysql/issues/60,
each int type will be mapped to its exact type.

>
> I use NUMERIC quite a lot in PostgreSQL; I don't know if the
> MySql type is similar. I would find it inconvenient that it
> is treated as a string. Can it not be converted to a numeric
> type, if that is what the user wants?

MySQL treats NUMERIC and DECIMAL the same, as exact
numeric types. What C++ type would you put this into?
float and double are not exact so they're not fit.

> I seem to get an assertion if I fail to read the resultset
> (resultset.hpp:70). Could this throw instead?

This assertion is not related to this. It's just checking that
the resultset has a valid connection behind it, and is not
a default constructed (invalid) resultset.

>
> Or, should the library read and discard the unread results
> in this case?

Having a look into the Python implementation,
it gives the option to do that. I think we can do a better job here.
Track by https://github.com/anarthal/mysql/issues/14

>
> But the lack of protocol support for multiple in-flight queries
> immediately becomes apparent. It almost makes me question
> the value of the library - what's the point of the async
> support, if we then have to serialise the queries?

As I pointed out in another email, it's a combination of
lack of protocol support and lack of library support.
Apologies if the documentation is not clear in this aspect.
I think there is value in it, though, so you don't need to create
5000 threads to manage 5000 connections. The fact that the
official MySQL client has added a "nonblocking" mode
seems a good argument.

>
> Should the library provide this serialisation for us? I.e.
> if I async_execute a query while another is in progress, the
> library could wait for the first to complete before starting
> the second.

I would go for providing that bulk interface I talk about in other emails.

>
> Or, should the library provide a connection pool? (Does some
> other part of ASIO provide connection pool functionality that
> can be used here?)

Asio doesn't provide that AFAIK. It is definitely useful
functionality, tracked by https://github.com/anarthal/mysql/issues/19

>
>
> Transactions
> ------------
>
> I have found it useful to have a Transaction class:
>
> {
> Transaction t(conn); // Issues "BEGIN"
> .... run queries ....
> t.commit(); // Issues "COMMIT"
> } // t's dtor issues "ROLLBACK" if we have not committed.
>

Again, how would this work in the async world?
How does the destructor handle communication failures
when issuing the ROLLBACK?

> Klemens Morgenstern makes the point that MySql is a trademark of
> Oracle. Calling this "Boost.MySql" doesn't look great to me.
> How can you write "The Boost MySql-compatible Database Library"
> more concisely?

I'm not very original at naming, as you may have already
noticed. Using Boost.Delfin was proposed at some point,
but Boost.Mysql definitely expresses its purpose better.

>
> Overall, I think this proposal needs a fair amount of API
> re-design and additional features to be accepted, and should
> be rejected at this time. It does seem to be a good start
> though!
>
>
> Thanks to Ruben for the submission.

Thank you for sharing your thoughts, I think there
is a lot of useful information here.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk