|
Boost : |
From: Corwin Joy (cjoy_at_[hidden])
Date: 2005-08-22 20:54:14
Having designed the database template library (a C++ STL type interface for
ODBC at http://dtemplatelib.sourceforge.net/) I thought I'd throw in a few
cents about some design you'll want to think through *very* carefully.
Here are some key questions to ask:
Iterator design:
ODBC supports multiple cursors types which I think are pretty typical of
various kinds of databases.
You want to think carefully about how you map these to STL iterator types.
Lets say you do a simple mapping of
forward cursor -> input iterator and
scrollable cursor -> random access iterator.
Immediately there are "gothcas"
input iterator:
- What happens when you copy your iterator? Should it point to the same
cursor? If you do then making a copy of the iterator and incrementing will
have a "side effect" on the original iterator. If you do not then users may
be surprised when they compare two copies of the same iterator and don't get
the same thing. How can two cursors be equal to begin() of a container but
not have the same value?
random access iterator:
- I assume your iterators will all point to the same recordset resource -
but if so, how do you avoid "thrashing". If a and b are two iterators that
point to the same scrollable recordset at different then you don't really
want to be excuting multiple database positioning operations to go back and
forth between the values. This implies some kind of intelligent caching but
you want to be carefull or you'll end up with more than you can handle in
memory since it is not uncommon for tables to have a million rows or more
(O.K., maybe the Human Genome center is a bit larger than most in this
regard :-) ).
- You'll probably also want to implemnent size() operations
intelligently, paging through every record and counting is not so good...
- Be aware of the performance cost of scrollable recordsets (see below
for more).
We also had a input + output iterator that we did to allow users to loop
through a recordset and modify values.
Binding design:
- In DTL we wanted to go away from the MFC model of forcing you to inject
our code into your classes. Instead we defined operators that specify the
mapping relationship between columns in your query and members in your
class. I think this is a good model since it keeps the database mappings
from getting tangled up with the class logic. This kind of design is also
used very successfully in the Java world by popular projects like Hibernate
which let you define your class as you need and simply ask you to provide
XML type mappings. I'm also not a big fan of frameworks that force you into
a particular class design.
Here is a link from DTL showing this:
http://dtemplatelib.sourceforge.net/dtl_introduction.htm
Here is an example of a hibernate mapping:
http://www.hibernate.org/hib_docs/v3/reference/en/html/quickstart.html#quickstart-mapping
Transparent SQL:
- One of the design goals in DTL was to try not to be "smart" about SQL.
The fact is that folks do all kinds of strange things in SQL like different
outer join syntax, hierarchical queries, analytic functions over join
partitions, stored proceedures etc. I think there is a real advantage to
having an iterator that can "get out of the way" and NOT try to parse what
the user is doing so they can execute these kinds of complex queries. Also,
SQL is well understood by most developers so letting them use it
transparently rather than creating your own join language will significantly
reduce the learning curve.
- Having said this, being able to construct select, insert, update
statements for the user has advantages - it lets you create read/write
containers from a set of bindings transparently and can let you work around
limitations of the underlying drivers.
Binding strings:
- In ODBC, at least, this is rather ugly. You can choose to either bind a
fixed size buffer (implying a maximum string size) or you can use something
called GetData and PutData to transfer an arbitrary length string in blocks.
If you choose the latter (which is what we did so folks could bind
std::string and not have to recompile every time their database field size
changed) ODBC gets annoying in that many drivers then require that such
string columns have to come last.
Think carefully about how you want to support stored procedures:
1. A stored procedure can have input parameters, output parameters and can
return multiple sets of results (and will in SQL server).
How to support user defined types?
1. Users will want to define custom types that can be mapped transparently
to sets of database fields. Examples:
- A complex data type that maps to a pair of fields.
- A "date" type that is stored as an integer (similar to boost::time)
but might be written to multiple database fields.
- Mapping of more complicated types, graphs, sounds etc. to blobs.
Transactions:
In ODBC, at least, there is not a database independent notion of a
transaction. Instead, when you commit, you commit everything open against
that connection. So if you want multiple transactions you need multiple
connections. We went with C++ try...catch logic to handle transactions
since this closely mirrors some of the ACID concepts you need for
transactions in a database. So we would have
try {
// a set of operations here
commit;
}
catch(...) {
rollback;
}
More on our design for this is here:
http://dtemplatelib.sourceforge.net/ExceptionHandling.htm
More later... anyway I would be interested in helping on this project but
don't know how much time I will have to commit to it. I definitely like the
idea of defining the base layer in terms of simple operations so that it can
support both ODBC and native drivers since I think this would be a big
advantage for situations where and ODBC driver simply is not available
...
Here is some background on how the ODBC cursor types compare to the C++
iterator types and a little bit on how the mapping was done in DTL
ODBC cursors.
The process of retrieving rows from a query result set and returning them to
the application is called fetching. An application fetches data with a
cursor. An ODBC cursor on a result set indicates the current position in the
result set and what row of data will be returned next. In ODBC there are
two main cursor types.
Forward Only Cursors:
- Can only move forward through the set of records.
- Are not guaranteed to move through the records in any particular
order.
- Two different cursors against the same table using the same query
may not return the same rows or same number of rows due to inserts, updates
and deletes.
Scrollable Cursors:
- Can move forward and backward through the result set.
- Two different cursors against the same table using the same query
may not return the same rows or same number of rows due to inserts, updates
and deletes.
Scrollable cursors raise two additional questions not seen in forward-only
cursors. First, how does the cursor establish some kind of order for the set
of records returned? Without some kind of order for the records moving
"forward" or "backward" through the result set does not have any meaning.
Second, should the cursor detect changes to rows that have been previously
fetched? In other words, when we re-position the cursor should it detect
changes due to inserted, updated or deleted rows? Depending on how these
questions are answered one obtains different scrollable cursor types.
Scrollable Cursors:
- Static cursor type.
o Once the cursor has been created, the set of records is fixed
regardless of any subsequent changes made to the database. Any inserts,
updates or deletes by other applications to the set of records are not
detected by this cursor. Usually this is implemented either by locking the
underlying table so that no changes can be made by other applications while
the cursor is in effect, or the set of records is viewed as of a particular
version.
- Dynamic cursor type.
o Detects any changes made to the set of records due to insert, update
or delete by other applications. Usually this kind of cursor is implemented
by simply ordering the records with a unique index. This cursor then
re-executes a fetch every time it scrolls to a new set of rows. Suppose the
original set of records is defined by "SELECT * FROM EMPLOYEES ORDER BY
EMPLOYEE_ID". Then, when the user requests that this cursor scroll to the
next set of rows the cursor would execute the query "SELECT * FROM EMPLOYEES
WHERE EMPLOYEE_ID > (?) ORDER BY EMPLOYEE_ID" where it would pass as a
parameter the EMPLOYEE_ID from the last row in the current buffer. This
kind of cursor actually creates multiple result sets due to the re-execution
of the query every time a scroll operation is performed. As a result, the
cursor does not establish a consistent order for the records in the sense
that the "second" record in the result set might change its value and
position if we scroll forward and then attempt to scroll back to this
"second" record.
- Keyset driven cursor type.
o When the cursor is opened, it immediately stores a set of unique
keys for all rows in the result set. Later, when the user attempts a scroll
operation, the cursor simply retrieves rows corresponding to the previously
stored keys. This means that updates by other applications are detected
with a new value for the row (unless the update changes the key - if another
application changes the key this is equivalent to a delete and then an
insert as far as a keyset cursor is concerned). Deleted records show up as
"holes" with a special status flag. As for inserts made by other
applications, these are not detected since these inserted rows did not
appear in the original set of keys that was fetched when the cursor was
first opened. This cursor does establish a consistent order for the records
in the sense that if we scroll forwards and backwards the same records will
appear in the same position, although they may be marked as having been
deleted and may contain updated values.
- Mixed cursor.
o This is a combination of a keyset driven cursor and a dynamic
cursor. This kind of cursor is typically used when the table is too large
to make it practical to save keys for all the rows in the result set. In
this case, the cursor fetches keys for some subset of the rows. Inside this
subset, the cursor behaves like a keyset driven cursor. Outside this
subset, the cursor behaves like a dynamic cursor. A mixed cursor is
equivalent to a keyset-driven cursor when the size of the key subset is
equal to the result set size. A mixed cursor is equivalent to a dynamic
cursor when the size of the key subset is equal to 1.
STL Iterators.
Here we summarize some key properties of certain C++ standard iterators.
For this discussion we are really interested in only three types of
iterators: input iterators, output iterators and random access iterators.
Input and output iterators are considered primarily because these have the
smallest set of requirements of all the iterator types and so can be mapped
most easily to an ODBC cursor. Random access iterators are interesting
since these have the largest set of requirements and correspond most closely
to the type of usage that is made of scrollable ODBC cursors. Here we
summarize only the iterator properties that are important to our discussion
(for full details on the iterator requirements see section 24.1 of the C++
standard).
Noteworthy Input Iterator Requirements
1.. Input iterators are assignable and copy constructable.
2.. Let a, b be input iterators. If a==b and (a,b) is in the domain of ==
then *a is equivalent to *b.
3.. Let r be an input iterator. If we increment r via ++r, then after the
increment any copies of the previous value of r are no longer required to be
dereferencable or in the domain of ==.
4.. For input iterators a==b does not imply ++a==++b. Algorithms on input
iterators should never attempt to pass through the same iterator twice.
They should be single pass algorithms.
5.. Let r be an input iterator. The result of *r must be convertible to
the value type of the iterator.
Some comments.
- Requirement two that the dereferenced value of a and b be
'equivalent' is vague. Standard is not clear on what 'equivalent' means,
but it does not mean these two must have the same value.
- Requirement three means that conceptually when we implement an
input iterator we are allowed to code it such that only one copy of the
iterator can be "active". The reason for this is that once one copy of the
iterator has been incremented other copies become invalid.
Noteworthy Output Iterator Requirements
1.. Output iterators are assignable and copy constructable.
2.. Let r be an output iterator. The only valid use of operator * is on
the left hand side of an assignment. After an assignment, the iterator, r,
must be incremented before it can be assigned to again. Algorithms on output
iterators should be single pass algorithms.
Some comments.
- Typically operator * on an output iterator is defined to return a
proxy class that performs a write of the data upon assignment. This proxy
might be the iterator itself or might be some other class.
How DTL Abstracts ODBC cursors as iterators.
SELECT statement, forward only cursor: Input Iterator
- In order to allow multiple copies of the iterator to be active at
the same time, we chose a design wherein each iterator holds its own ODBC
cursor. This allows us to have logic like a=b; ++a; ++b; and have both
iterators a and b be able to return a value. In fact, we did not have to
do this since the standard allows only one copy of the iterator to be
active. So, under the standard it would be allowable for b to become invalid
once a had been incremented.
- Note that we are taking full advantage of the standard in
performing this mapping. Records are not guaranteed to come back in any
particular order, nor are two copies of an iterator guaranteed to retrieve
the same set of records due the fact that inserts, updates and deletes may
be performed on the set of records the select statement is operating
against.
INSERT, UPDATE, DELETE statement, forward only cursor: Output Iterator
- This is fairly straightforward. The only trick here is that the
execute of the statement must happen on assignment. Otherwise, loops that
use post-increment logic to write to the iterator will not behave correctly.
STORED PROCEDURE, forward only cursor: Input + Output Iterator
- Mapping stored procedures to C++ iterators was less
straightforward since stored procedures perform both input and output.
- Stored procedures accept arguments as input, and in this way act
as an output iterator.
- Stored procedures can return results from a query and can return
results via its arguments so in this way it acts as an input iterator. What
is more, stored procedures can return more than one result set which means
the resulting iterator may need to act over more than one range. Therefore,
we added a MoreResults() method to the sql_iterator object so that the
iterator can pass through multiple result sets that might be returned by
the query.
- To split the input and output for stored procedures we have
sql_iterators hold two buffers. One buffer we called the data object
(DataObj) which stores the result sets that come back from the query. The
second buffer we call the parameter object (ParamObj) which is designed to
hold parameters that are passed to the query.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk