Boost logo

Boost :

From: Ruben Perez (rubenperez038_at_[hidden])
Date: 2022-05-15 13:49:21


 .

On Fri, 13 May 2022 at 04:25, Alan de Freitas via Boost
<boost_at_[hidden]> wrote:
>
> Hi! Here's my review.

Hi Alan, thank you very much for taking your time to write your review.

>
> Boost.MySql is an impressive library. I would like to thank Rubén for that.
> It implements the complete protocol from the ground up with async
> functionalities and certainly involved a lot of work. The documentation is
> extensive, goal-oriented, and helpful. The async operations support
> cancellation and the examples with coroutines are beautiful. During this
> review, I was impressed at how easy it works and I've noticed a few points
> in the API that have also been brought up in previous reviews and could be
> improved. Some other issues could also be better highlighted in the
> documentation, which would avoid many problems. The overall recommendation
> in this review is acceptance conditional on some fixes.

Thanks a lot.

>
> Others have questioned the benefit of the library when compared to sqlpp11
> or any wrapper around the C API. The main difference is other libraries are
> high-level but this is a discussion still worth having from the point of
> view of users. I was thinking about the transition cost from Boost.MySQL to
> any other SQL database, since many applications have the
> requirement/necessity of allowing different SQL databases. In sqlpp11, the
> user can just change the backend. The user could use Boost.MySQL as an
> sqlpp11 backend and that would have the same effect. However, I think Rubén
> mentioned this is not possible at some point. I'm not sure whether this is
> just for the async functionality. In the same line, I wonder if a library
> for Postgres or Sqlite would be possible with a similar API, which could
> also solve the problem, although I'm not sure anyone would be willing to
> implement that. If we did, we could have the convenience of sqlpp11 and the
> ASIO async functionalities of Boost.Mysql for other DBs.

It is indeed possible. My point was that sqlpp11 does not provide an async
interface. You can surely write a sqlpp11 backend for MySQL using
the proposed Boost.MySQL library. I haven't done it, but having a quick
look at the interface, it should just work. However, you won't get the most
of Boost.MySQL, because you will be using just sync code.

It is possible to write a Postgres connector with a similar approach.
I indeed wrote a POC available here:
https://github.com/anarthal/postgres-asio
The protocols are different, though, so if you follow my approach
of providing primitives easy to use, yet close to the protocol,
the APIs will be a little different. The Postgres protocol allows
finer grain control over statement execution, AFAIR.

I don't know much about SQLite - but being just a library
and not a protocol, the connector would have to wrap libsqlite.
I can't tell you whether exposing an Asio interface for it
is possible or not.

>
> The library really provides ease-of-use, when we consider what it provides
> and how low-level it is. However, unlike in other libraries like
> Boost.Beast, Boost.MySql users might not be sold into the Asio way of doing
> things. Applications that require access to databases might be making
> sparse database requests where the Asio asynchronous model is not as
> useful. Highlighting these differences in the docs is important. Asio takes
> some time to learn, and I guess for a user not used to Asio, already
> understanding Asio does not sound like the ease of use. The docs could
> focus on the protocol before moving on to the asynchronous functionality.

What actions do you think we can take here? Adding a section in the intro
(https://anarthal.github.io/mysql/mysql/intro.html) about "knowing Asio
is a requirement to use this library"? Do you think an overview on the protocol
(handshake, query, metadata and row packets) would really be useful
for the user at that point?

>
> I'm also a little worried about the maintainability of the library and
> protocol changes and how this could impact boost as a whole. Should we
> really announce it as compatible with MariaDB? What direction would the
> library take if they diverge? How often does the protocol change or is
> extended? Is Ruben going to stick around if the protocol changes? How hard
> would it be for someone to understand the code and implement extensions?
> Can a user be sure it's always going to provide the same features and be as
> reliable as the C API? I don't have the answer to these questions, but it's
> something that got me wondering. I guess this kind of question is going to
> come up for any library that is related to a protocol.

The protocol it implements is the "client/server" protocol, which was in
place before the MariaDB fork. From what I know, this is a pretty old
protocol and remains pretty stable. MySQL is separately developing a
new X-protocol`(which this library does *NOT* implement). Docs here:
https://dev.mysql.com/doc/dev/mysql-server/latest/page_mysqlx_protocol.html
MariaDB does not implement it and does not plan to do it. I don't
really think MySQL is going to develop the client/server protocol a lot.
I wouldn't say they will deprecate it in the next years, either, as
there are millions of clients relying on the client/server protocol,
and there aren't many connectors for the X protocol yet (no C
client AFAIK). Does that answer part of your questions?

>
> I don't know if the name "MySql" can be used for the library, as it belongs
> to Oracle. I'm not saying it can't. I'm really saying I don't know. I'm not
> a lawyer and I don't understand the implications here. But this should be
> considered, investigated, and evidence should be provided. The library is
> also compatible with MariaDB and the name "MySql" might not reflect that.
> Maybe there's a small probability it might be compatible with some other
> similar DB protocol derived from MySql in the future?

I'm as stuck as you here. What would be the best path here?
For now, I've tried to contact a former user in the Boost ML who works
at Oracle, to see whether he can provide me some guidance here.
Any ideas are welcome.

> - No two asynchronous MySql query reads can happen concurrently. While this
> still has value among other Asio operations, like a server that needs the
> DB eventually, the user needs to be careful about that. Maybe it would be
> safer if all MySql operations were on some special kind of strand. Or maybe
> the library could provide some "mysql::transaction_strand" functionality to
> help ensure this invariant for individual queries in the future.
> - A second implication is that some applications might find the
> asynchronous functionalities in Boost.Mysql not as useful as asynchronous
> functionalities in other protocols, like the ones in Boost.Beast. This
> depends on how their applications are structured. Since this is the main
> advantage over the C API, these users may question the value of the library
> and the documentation should discuss this more explicitly.
> - These implications could become irrelevant if the library provides some
> kind of functionality to enable a non-blocking mode. I have no idea how the
> MySql client achieves that.

At the end of the day, I'd say this is similar to HTTP (half-duplex).
The client sends a request and gets a response. I think the docs here
are not doing a great job, and that the interface may be a little limiting
(not allowing the possibility of pipelining). This is something brought up
in other reviews, too, and tracked by
https://github.com/anarthal/mysql/issues/76
and https://github.com/anarthal/mysql/issues/75.
Do you think such pipelining API would mitigate your concerns?

> I'm worried about the lack of possibility of reusing memory for the
> results, as the interface depends on vector. This is not the usual Asio
> pattern. These vectors look even weirder in the asynchronous callbacks:

This is also an issue for resultsets, and tracked by
https://github.com/anarthal/mysql/issues/59.

>
> - People have their containers/buffers and I would assume reading into some
> kind of existing row buffer would be the default interface, as is the case
> with other Asio read functions. In other words, read_many and read_all
> should work more like read_one.
> - Not returning these vectors is the common pattern in Asio: the initiating
> function receives a buffer for storage and the callback returns how many
> elements were read. Note that the buffer size already delimits how many
> elements we should read.
> - If we return vectors, async operations would need to instantiate the
> vector with the custom allocator for the operation. The callback wouldn't
> use std::vector<T> then. I would be std::vector<T,
> very_long_allocator_name> for every query callback, which is inconvenient.
> Using only std::vector<T> in the callback is not only inconvenient but
> dangerous because it would be the wrong allocator and everything would need
> to be copied.

Brought up by another user during review, tracked by
https://github.com/anarthal/mysql/issues/58. I think the best thing
here is just passing the vectors by lvalue reference.

> - One problem I see without the vectors is that each row would still need
> to allocate memory internally because of their dynamic size. I don't know
> the best solution for that second problem unless we have some sort of
> static size row/tuple, which does make sense since the user should know the
> number of columns at compile time.

Brought up by everyone, so I will give this high priority. Tracked by
https://github.com/anarthal/mysql/issues/60.

>
> The library provides read_one, read_all, and read_many, but no "read_some".
> Besides the implication of a buffer to the initiating function, I would
> imagine reading as much as possible and parsing that to be the usual best
> strategy in this kind of an operation since we have a stream and we don't
> know how much data is available to read yet. Unfortunately, read_one,
> read_all, and read_many don't allow that. Unless read_many can already read
> less than the max number of rows in intermediary calls, which I don't think
> is the intention. It might also be useful to replicate the naming
> conventions Asio uses to read some or read all of the data.

Not sure if I follow you. What's the interface you propose here?
The connection object already holds an internal buffer for requests.
The library currently performs boost::asio::read call per packet,
but I'm working in optimizing that to use read_some instead.
With that approach, read_one may get the row from the buffer
without performing any read at all. Is that what you mean?

>
> If "value" is intended to encapsulate the variant type,
> "value::to_variant" seems dangerous. The variant supports lots of types,
> which indicates the probability the internal design might need to change to
> fit new user demands is high. This could also need to change according to
> protocol extensions, and changing any of these types would break the API.
> If "value" already represents what it needs to represent,
> "value::to_variant" may not need to be exposed unless there's a lot of user
> demand for that.

I exposed it to avoid having to implement a visit function. Having
such a visit function would have the same problems. I think
visiting a variant-like is one of the most common things. It make
sense not to expose the function if we end up with an alternate
tuple-like interface, though.

>
> There is also an existential value/reference problem with "value":
>
> - The internal variant type includes a string_view whose string is owned by
> the row. This seems kind of dangerous or at least misleading. We have a
> reference type that's called "value", while this is more like a const
> "field view" or something.
> - At the same time, it's not a view either because I assume this is
> mutable, and updating numbers would not update the row.
> - The other values in the variant are value types, which makes "value" both
> a value and a reference type depending on the column type.
> - A reference type seems appropriate here because we don't want to copy all
> the data when iterating the results, but we need some other pattern for
> "value". A const view is especially useful because we might be able to keep
> value types for numbers, etc... Encapsulating the variant is also helpful
> here.
> - Calling a type "value" is a bad indication anyway. Any value is a "value".

If we go down the compile-time parsing strategy, together with the buffering
I described above, making it owning may be the best option. Any naming
ideas? field_value? field? sql_value?

>
> As others have mentioned, "value::get_std_optional" is unusual. It may not
> look problematic at first but it has a number of problems:
>

First of all, I'd like to know your thoughts about the conversion
functionality in value
(https://anarthal.github.io/mysql/mysql/values.html#mysql.values.conversions).
Do you think it's helpful for the user? If it wasn't because of conversions,
we could replace value::get_optional and value::get_std_optional for
something similar to variant2::get_if, returning a pointer to the
underlying value,
avoiding the problem altogether. This could make even more sense if we make
the value owning, as value::get_optional<std::string> would end up copying
the underlying string, which doesn't make a lot of sense.

> If I understood the documentation correctly, it's also problematic to use
> an empty optional for both a null value and an invalid result. They are
> simply not the same thing. Libraries always differentiate an invalid state
> from null in a field. This is especially relevant because the documentation
> warns us against using the "is_convertible_to" pattern over "get_optional"
> because "is_convertible_to" is unnecessarily slow. By the way, it would be
> nice to explain why "// WARNING!! Inefficient, do NOT do" is true. I
> wouldn't expect that to cost more than creating the optional value.

The distinction can be made with something like:

value v = /* get the value */
if (v.is_null()) {
    // handle the NULL case
} else {
    double d = v.get<double>(); // will throw on type mismatch
}

So maybe what we're lacking is something in the documentation
to address how to handle NULL values effectively. I've raised
https://github.com/anarthal/mysql/issues/77
to address this issue.

Calling value::is_convertible_to, then value::get will perform the conversion
twice, while value::get_optional will do it just once. It isn't
horribly inefficient,
though - maybe this section is causing more confusion than aid.

Considering all this, I'd make the following to the value class:
1) Make it owning.
2) Remove conversions.
3) Remove get_optional and get_std_optional, and provide a get_if
function returning a pointer to the internal value, as variant2::get_if does.
4) Make value::get return a reference instead of a value.
5) Remove value::to_variant

I've summarized these points in https://github.com/anarthal/mysql/issues/85.

>
> The documentation states that the behavior of relational operators may
> change in the future. What do other libraries do about these comparisons?
> We probably don't want something like PHP, where "20" == 20 so that we need
> some kind of "20" === 20 in the future, but the documentation implies we
> also don't want "value(std::int64_t(200)) > value(std::uint64_t(200))" to
> fail. The library makes the whole thing particularly tricky with number
> types because not checking the proper number type in the variant would
> return invalid/null even when the number could be stored in a variable of
> another number type perfectly fine, which is even worse when both invalid
> and null are presented by the same empty optional.

Thanks to conversions, value(int64_t(200)).get_optional<uint64_t>() succeeds
and returns a non-empty optional. The main use I see for inequality operators
is being able to store values in containers such as std::set, so I don't think
we should change the behavior in the future. Raised
https://github.com/anarthal/mysql/issues/80.

>
> About the "resultset":
>
> - If "resultset" is a table, the "set" part of the name might be a little
> misleading in C++ as the results are in a deterministic sequence, even
> though this is what MySQL may call a similar concept. However, I imagine
> there's no equivalent concept in MySql for this specific
> Asio-async-lazy-fetching functionality itself.
> - This class is more like a "table", "frame", "vector", "list", or just
> "result". However, after all the steps of "connect", "prepare", and
> "execute", the name "result" gives the impression that the results are
> finally there, but this is just another stream object like a "result
> fetcher" or something.

I'm thinking on calling it query_operation or query_result.

> - Since the protocol is strictly sequential, it might be useful to not even
> have a resultset class at all. The connection could send the query and have
> some read function, like the usual read/write functions in Asio can operate
> on the same stream.

Resultsets also hold whether they're complete or not, and some extra
info like affected_rows and warning_count, which become available once
the resultset has been completely read (or from the beginning, if the
query was an INSERT, for example). Docs:
https://anarthal.github.io/mysql/mysql/resultsets.html#mysql.resultsets.complete
If we implement pipeline mode, it will have to hold some state specific
to the operation (i.e. a sequence number that the protocol dictates - although
that's internal). It is also possible to make resultset (or whatever name we
come up with) not an I/O object, and move the read functions to the connection.

>
> Minor issues:
>
> - Maybe it's not clear from the examples, but why would the user need
> connection::handshake when connection::connection already does it?

If your Stream object does not satisfy the SocketStream requirements
(https://anarthal.github.io/mysql/mysql/ref/boost__mysql__SocketStream.html),
you won't be able to use connection::connect or connection::close.
This would be the case for Windows named pipes.
Docs here:
https://anarthal.github.io/mysql/mysql/other_streams.html#mysql.other_streams.connection

> - Shouldn't `tcp_ssl_connection::connect` accept an EndpointSequence to be
> more like other Asio functions.

I think it should accept both. Tracked by
https://github.com/anarthal/mysql/issues/72

> - The names "tcp_ssl_prepared_statement", "tcp_ssl_<component>" are quite
> repetitive and verbose in the documentation. Maybe "tcp_ssl" could become a
> namespace? Isn't that what Asio does for the ssl related objects? Some
> names might be reduced if there's no alternative. For instance,
> "prepared_statement" might become "statement" since there's no
> "unprepared_statement".

Would having type aliases under connection (i.e. connection::resultset_type)
alleviate your concerns?

> - Wouldn't some form of `statement::execute` with variadic args (different
> name perhaps) be a simpler alternative to
> `statement::execute(boost::mysql::make\_values(...))`. Or maybe
> statement::operator(), as Phil suggested. Or some
> `statement::execute_with(...)`. In the variadic args, the optional err/info
> can go into some specified positions, since they're not ambiguous with the
> field variant types. This is a little tricky but definitely useful,
> especially if suggestions of customization points are implemented in the
> future.

Definitely. https://github.com/anarthal/mysql/issues/74 tracks it.
How would variadic operator() work with async functions? The convention
is having the _async suffix, but that's not possible with an operator overload.

>
> ## Implementation
>
> > What is your evaluation of the implementation?
>
> I haven't analyzed the implementation very profoundly. I skimmed through
> the source code and couldn't find anything problematic. It would be useful
> if the experts could inspect the Asio composed ops more deeply.
>
> CMakeLists.txt:
>
> - I believe the CMakeLists.txt script is not in the format of other boost
> libraries in boost/libs so it won't work with the super-project as it is.

Not aware there was a standard format. Could you please point
me out to the relevant documentation?

> - The example/CMakeLists.txt script refers to
> BOOST_MYSQL_INTEGRATION_TESTS. I don't think examples can be considered
> integration tests.

They are from the perspective that they require a real server to run.

>
> Examples:
>
> - The examples are very nice. Especially the one with coroutines. They are
> also very limited. They are all about the same text queries, which
> shouldn't even be used in favor of prepared statements.

Tracked by https://github.com/anarthal/mysql/issues/81. May be worth
reconsidering when we tackle https://github.com/anarthal/mysql/issues/69

> - Many examples about continuation styles are not very useful because this
> is more of an Asio feature than a library feature. The library feature, so
> to speak, is supporting Asio tokens properly. The continuation styles could
> be exemplified in the exposition with some small snippets for users not
> used to Asio without the documentation losing any value.

I may move them into a separate section, instead of having all of them
in a single section.

> - The page about the docker container should specify that the username and
> password are "root" and "".

Raised https://github.com/anarthal/mysql/issues/82

>
> Tests:
>
> Some unit tests take a ***very*** long time. Enough to make coffee and a
> sandwich. And they seem to not be adding a lot of value in terms of
> coverage. For instance, "mysql/test/unit/detail/protocol/date.cpp(72):
> info: check '1974- 1-30' has passed" going through all possible dates
> multiple times took a long time.

True. They may have been useful during development but they don't
give us a lot as regression tests. Raised
https://github.com/anarthal/mysql/issues/83

> The documentation is complete. The main points that differentiate the
> library are
>
> - it's a complete rewrite of the protocol,
> - it's low-level and
> - it's based on Boost.Asio
>
> The documentation should emphasize these points as much as possible,
> especially the first one. This should be in the introduction, the
> motivation, slogans, logos, and wherever people can see it easily.

Has been noted in several other reviews.
https://github.com/anarthal/mysql/issues/70 tracks it.

>
> The documentation should also provide arguments and evidence that
> these design goals are a good idea, as often discussed when the topic is
> the value of this library. Why is it worth rewriting the protocol? To what
> use cases are such a low-level library useful? Why should a person who
> already uses other libraries or the C API care about Asio now? Something
> that should also be highlighted is the difference between the library and
> other higher-level libraries, in particular, naming names.

https://github.com/anarthal/mysql/issues/50 tracks it.

>
> Minor issues:
>
> - There's no link in the documentation to the protocol specification. It
> would be interesting to know what the reference specification is. Or
> whether the protocol was inferred somehow. Is there any chance this
> protocol might change? What about divergences between MySql and MariaDB?
> How stable is the protocol? For what range of versions does it work? What's
> the policy when it changes?

Raised https://github.com/anarthal/mysql/issues/84.
FYI, this is the specification:
https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_PROTOCOL.html.
Some mechanics are better explained here:
https://dev.mysql.com/doc/internals/en/client-server-protocol.html
(although it's an old page).

> - Some links are broken (for instance, linking to
> https://anarthal.github.io/boost-mysql/index.html).

Just fixed it in develop.

> - "All async operations in this library support per-operation
> cancellation". It's important to highlight this is per operation in the
> Asio sense of an operation but not in the MySql sense of an operation
> because the MySql connection is invalid after that.

Note added in develop.

> - "Boost.MySql has been tested with the following versions of MySQL".
> MariaDB is not a version of MySql.

Fixed in develop.

> - Prepared statements should come first in the examples, to highlight them
> as the default pattern.

Will be tackled as part of https://github.com/anarthal/mysql/issues/81

> - The documentation refers to topics that haven't been explained yet. Maybe
> "value" could be explained after "row", and "row" could be explained after
> "resultset" and "resultset" after "queries".

The basic concepts have been covered in the tutorial, which goes first.
So at this point the user should know the *basic* concept of what a value,
a row and a resultset is. The rest of the sections go deeper in these concepts.
Did you find section order counter-intuitive, even reading the tutorial first?

> - The section "Text Queries" is quite small in comparison to other
> sections. It could include some examples and snippets like other sections
> do.

I'll probably expand it as part of https://github.com/anarthal/mysql/issues/69,
because there will be more stuff to cover.

> - "The following completion tokens can be used in any asyncrhonous
> operation within Boost.Mysql" -> "Any completion token..."

Fixed in develop

> - "When they fail, they throw a boost::system::system_error exception".
> Don't these functions just set the proper error_code, as usual with Asio
> and Beast?

No. In Asio, you always get two overloads for sync operations. Let's take
basic_stream_socket::read_some, for instance:
https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/basic_stream_socket/read_some/overload1.html
https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/basic_stream_socket/read_some/overload2.html
Overload 1 throws boost::system::system_error, while overload 2 returns the
error by setting the appropriate error_code. This library is
consistent with that.

> - The "MySQL to C++ mapping reference" section should be using a table.

I'd say this is personal preference. I will explore the tabular format
when I have the chance.

> - A small subsection on transactions would be helpful even if there's no
> library functionality to help with that.

https://github.com/anarthal/mysql/issues/65 tracks it.

> - The documentation should include some comparisons that are not obvious to
> potential users. C/C++ APIs. The advantages of the Asio async model.
> Benchmarks if possible.

https://github.com/anarthal/mysql/issues/50

>
> > Are there any immediate improvements that could be made after acceptance,
> if acceptance should happen?
>
> While it's important to have a general variant type for row values, a
> simpler interface for tuples of custom types would be very welcome and
> would simplify things by a lot, while also avoiding allocations, since
> columns always have the same types. This feature is too obvious since users
> almost always know their column types at compile time and this demand is
> too recurrent in applications to ignore.
>
> > Do you think the library should be accepted as a Boost library? Be sure
> to say this explicitly so that your other comments don't obscure your
> overall opinion.
>
> I believe it should be conditionally accepted with the same conditions
> stated in other reviews: allowing for memory reuse in the read_* functions
> and fixing the "value" type.

Could you please be a little more explicit on your second point?
Would adding a struct/tuple interface
(https://github.com/anarthal/mysql/issues/60)
and making it owning (https://github.com/anarthal/mysql/issues/85), which would
in turn tackle all the optional stuff, be enough?

Kind regards,
Ruben.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk