Boost logo

Boost :

From: Ruben Perez (rubenperez038_at_[hidden])
Date: 2022-04-01 22:01:35


On Fri, 1 Apr 2022 at 23:35, Marcelo Zimbres Silva <mzimbres_at_[hidden]>
wrote:

> Hi,
>
> On Fri, 1 Apr 2022 at 16:24, Ruben Perez <rubenperez038_at_[hidden]> wrote:
> >> On Fri, 1 Apr 2022 at 12:32, Marcelo Zimbres Silva <mzimbres_at_[hidden]>
> wrote:
> >>>
> >>> Hi Ruben,
> >>>
> >>> >>
> >>> >>
> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__connection/async_query/overload1.html
> >>> >> The handler signature for this operation is
> >>> >> void(boost::mysql::error_code, boost::mysql::resultset<Stream>)
> >>> >>
> >>> >> Hier the completion handler is returning an io object i.e.
> >>> >> boost::mysql::resultset, which is also unusual as far as I
> >>> >> can see. Does that play nice with executors? I would
> >>> >> prefer having the completion handler return config
> >>> >> parameters with which I can instantiate a
> >>> >> boost::mysql::resultset myself in my desired executor.
> >>> >>
> >>> >
> >>> > The three I/O objects this library provides (connection,
> >>> > resultset and prepared_statement) are just proxies for the
> >>> > underlying Stream I/O object in terms of the executor (i.e.
> >>> > they just return the underlying Stream::get_executor()
> >>> > value).
> >>>
> >>> My expectation is that the communication with the mysql server occurs
> >>> only through the connection class. It feels awkward to me that I need
> >>> these proxy objects when there is a connection object around, the
> >>> lifetime of these proxies are bound to it for example.
> >>
> >>
> >> In addition to a pointer to the connection object, resultsets also hold
> >> several pieces of metadata about rows. Because of how the MySQL
> >> protocol works, these pieces of metadata are mandatory to parse rows.
> >> The MySQL server sends these as several packets following
> >> connection::query or prepared_statement::execute, and the library
> >> stores these in the resultset object. In my opinion, you need an object
> >> representing this information about how to interpret a resultset.
> >> Something similar happens with prepared_statement.
> >>
> >> Something we could do is remove the pointer to the connection
> >> from the resultset and prepared_statement objects, making these
> >> just plain data objects. You would have the following function
> signatures
> >> in connection (I'll just list the sync-with-exception ones for brevity):
> >>
> >> void connection::query(string_view query, resultset& output);
> >> void connection::prepare_statement(string_view statement,
> prepared_statement& output);
> >> void connection::execute_statement(const prepared_statement& stmt,
> resultset& output);
> >> void connection::read_row(resultset& resultset, row& output);
>
> This looks better to me. If the metadata is simple and small enough
> you can return it on the completion handler as you are doing now. That
> would be ok to me as it is not an IO object anymore.
>

It is a pretty complex structure but mostly lives in the heap and
it is move-only. I can consider returning this as rvalue reference, though.

>
> >> The drawback of this approach is that you should always
> >> call execute_statement and read_row on the connection object
> >> that created the statement or resultset object, and with this approach
> >> there is nothing enforcing this. I picked the other approach because
> >> I feel it is more semantic. Would these signatures make more sense for
> you?
>
> You can still provide the call to execute_statement and read_row as a
> composed operation so that users have to call only one function.
>
> >>> Some further points:
> >>>
> >>> - A row in your library is basically std::vector<value>, which means
> >>> std::vector<row> is pessimized storage for rows with same length.
> >>> Aren't rows always equal in length for a table?
> >>
> >>
> >> The std::vector<row> overloads are not optimized for performance, that
> is
> >> true. If you're striving for better performance, I would go for not
> having
> >> a vector at all, and reading one row at a time with
> resultset::read_one, which
> >> allows you to re-use the same row over and over, avoiding allocations
> almost at all.
>
> That is what I would do.
>
> >> What would be your suggestion for the vector<row> approach? Having an
> >> ad-hoc, matrix-like container for values?
>
> I wouldn't offer this function at all. Let users decide how they want
> to read the rows. With coroutines this becomes trivial.
>
> >>> - Your value class is a variant behind the scenes i.e.
> >>>
> >>> boost::variant2::variant<null_t, std::int64_t, std::uint64_t,
> >>> boost::string_view, float, double, date, datetime, time>;
> >>>
> >>> I would like to understand what is the lifetime of the string_view.
> >>> Is it pointing to some kind of internal buffer? I expect all data to
> >>> be owned by the client class after the completion of an async
> >>> operation. Having pointers pointing to the connection internal state
> >>> feels bad.
> >>
> >>
> >> The string_view's point to memory owned by rows, and never to the
> >> connection internal buffers. As long as you keep your row object
> >> alive, its values will be valid. This is also the reason why row objects
> >> are movable but not copyable. Currently, row packets are read into
> >> the row buffers directly and parsed in-place, to avoid copying.
> >> This is stated in the row object docs, but it may be worth noting it
> >> in the value object docs, too.
> >>
> >> Having all the variant alternatives as cheap-to-copy objects
> >> has the advantage of being able to implement functions with
> >> conversions, like value::get or value::get_optional, without
> >> worrying about expensive copies.
> >>
> >>>
> >>>
> >>> - There one further point that doesn't seem to play well with this
> >>> variant design. Some people may store some kind of serialized data in
> >>> the database e.g. json strings. It would be nice if it were possible
> >>> to read those values avoiding temporaries, that is important to reduce
> >>> latency and memory usage on large payloads. It would look something
> >>> like
> >>>
> >>> struct mydata1, mydata2;
> >>> using myrow = std::tuple<std::string, ..., mydata2, ..., mydata, ...>
> >>>
> >>> myrow row;
> >>> result.read_row(row);
> >>>
> >>> In this case, the parser would have to call user code every time it
> >>> reads user data in the socket buffer, instead of copying to additional
> >>> buffers ( in order to hand it to user later when parsing is done).
> >>
> >>
> >> I have in mind implementing this in future versions. My idea is creating
> >> a concept similar to Beast's http::basic_parser. That row_parser object
> >> should be a type with member functions like:
> >>
> >> row_parser::on_value(std::uint8_t)
> >> row_parser::on_value(std::uint16_t)
> >> // And so on, for all the possible types MySQL allows
> >> // Strings could have special handling, implementing incremental
> parsing for them
> >>
> >> That would be called like:
> >>
> >> struct mydata1, mydata2;
> >> using myrow = std::tuple<std::string, ..., mydata2, ..., mydata, ...>;
> >> struct my_parser {
> >> myrow r;
> >> error_code on_value(std::uint8_t) { /* place the value in your row
> object }
> >> };
> >>
> >> my_parser parser;
> >> resultset.read_row(parser);
>
> I think it get even simpler, once you allow users to pass their own
> types as rows, all they have to provide is a function to deserialize
> them from the read-buffer directly into the object e.g.
>
> void from_string(T1& obj, char const* data, std::size_t size,
> boost::system::error_code& ec)
> void from_string(T2& obj, char const* data, std::size_t size,
> boost::system::error_code& ec)
>

You're assuming MySQL sends everything as strings that can be trivially
parsed,
but that's not the case. There are two different encodings, depending on
whether
a row came from connection::query or prepared_statement::execute (text
queries
employ a plain-text format, while prepared statements use binary encoding).
In particular,
the types DATE, DATETIME, TIME and BIT employ a nasty struct-like format in
the binary
protocol. BIT employs a binary-inside-text representation in the text
protocol. DATE, TIME and
DATETIME have the concept of "zero dates", which are invalid values MySQL
allows but
that don't represent any actual value. NULL values are represented as a
bitmap
in the binary protocol, and as special values in the text protocol.

I would expect a MySQL library to handle all these details (I would say
this is
one of the main benefits this library grants the user). As a user, I
wouldn't want
to know all this stuff.

> ...
>
> std::tuple<T1, T2, ...> row;
> connection.read_row(row);
>
> When your parser hits user data it calls from_string with the correct
> type that it knows now, as you are passing as argument.
>
> Marcelo
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk