Boost logo

Boost :

Subject: Re: [boost] [stldb] design details (was Boost library submission (poll for interest))
From: Bob Walters (bob.s.walters_at_[hidden])
Date: 2010-01-07 12:36:11


On Tue, 5 Jan 2010 16:54:52 +0100 Stefan Strasser <strasser_at_[hidden]> wrote:

> I thought the reason that requires you to load the entire dataset on startup
> and save it entirely at checkpoints and recoveries is that the user needs to
> access the mapped region directly through pointers, since the user types
> exist in an unserialized form there. is that right?

IIUC by load you mean reconstruct the memory mapped file from the checkpoints + logs?
The only time that I have to reconstruct the memory mapped file or
shared memory region using the checkpoints and logs is when recovery is
done, NOT during routine/controlled startup and shutdown. See the architecture page:
http://stldb.sourceforge.net/stldb/arch.html. I specifically wrote this page
after our prior discussions. If processes disconnect normally, and recovery is not needed
then I reuse the region. When recovery is required, I do have the problem of having to
reconstruct the memory mapped file using the whole of the last checkpoint. Thus
I agree your shadow paging of the container in the memory map would reduce
recovery time.

Note that I am changing my checkpoint process so that each subsequent checkpoint will only write
out changes since the last checkpoint, and do so by reusing free portions of the checkpoint file.
Consequently the I/O directed to both the log and checkpoints is in proportion to the
rate of change, and not to the size of the database.

STLdb was originally designed as a memory resident database. It should do that well since
it provides direct access to containers in a shared region. My thinking was that if the user
elects to use a memory mapped file for the region, they
then have a paged database. The I/O directed at checkpoints and logs is based entirely on
rate of change. However, that paged database then introduces redundant disk I/O
since it will page out per LRU while I also occasionally write changed entries out to the checkpoints
as checkpoints are done. IIUC you have a similar problem but it involves redundancy
of memory. If you work with a large intrusive container, then you have the serialized storage memory mapped
in, but must also cache the deserialized nodes of the container in memory to ensure fast locator
traversal, and allow it to work like an index. What I am currently very intriguied by is your idea of
using STLdb as indexes for a large database. i.e. The containers are map<key,locator>, and refer
to a separate Boost.Persistence stored objects. Ideally, the containers will be small and tend to be
held in memory (due to frequent use) so the double-write problem goes away since the OS won't
be paging the memory map. A user could even mandate that the indexes stay memory resident by
using a shared memory region type, rather than a memory map.

There's another benefit to this - it would allow heterogenous containers, to the theme of map<key,base*>
because of the locator's use of boost serialization. And also because the real value-type is not
in shared memory, they can once again have virtual methods, etc.

> what confuses me is the "Updating Entries" section and the fact that you do
> track changes made by the user, e.g. via trans_map::update.
> so if the library is aware of every change to the mapped region, what
> stops you from employing a shadow paging technique and only writing to the
> mapped region on transaction commit, when the modified pages are logged?

> that would make your library MUCH more widely usable.

Right now, the interaction between the containers and the rest of the transactional
infrastructure is based off an approach inspired by the stasis library
(http://tardis.cs.berkeley.edu/stasis/developers/html/main.html). There is a
concept of a transactionally aware object. As modifiers on
transactionally aware objects are called, those objects post 'TransactionalOperations' onto
the transaction passed to them. These ops represent the means by which the transaction can ultimately
commit or rollback that operation when the time comes. Containers like trans_map are transactionally aware
and define their own atomic operations (i.e. in the case of trans_map,
insert(), lock(), update(), erase(), clear() and swap().) When these operations are logged, it is the
logical operation (insert<5,"foobar">) which is logged, not the changes done to memory,
so that during recovery it can be replayed. I am not logging memory changes.

A big contrast between this STLdb library and Boost.Persistence is that in your case, IIUC
you are tracking the use and modification of objects but the objects themselves are unaware.
One downside of this is that the objects can't collaborate to support piecemeal updates of small
portion of the objects - it's always the full object which is re-serialized. Because I was starting
with the notion of persisting containers, I sought to devise a model for a transaction-aware object
which supported updates against portion of the object. Thus I came up with the idea of
the transactional objects supporting one or more transactional operations. The tracking of changes
is based on the use of these transactional operations.

The only reason I went with methods like lock() and update() is that while I could have
done this implicitly via non-const access to iterators (for update anyway), I thought that implicit
approach would be an easy source of bugs. i.e. My 1M finds() become 1M find() and updates()
because I forgot to make the getName() method on my object const. Are you requiring const
correctness when accessing locators? I really like your interface in Boost.Persistent, but are
you currently assuming an update when a locator is accessed non-const, or is it more clever than
that, i.e. making a transaction-local copy during non-const access, but later checking for differences
between the transaction-local copy and original before reserializing and storing the object?

> because on the other hand, dereferencing trans_map::iterator returns a
> non-const reference to value_type, indicating the opposite.

Bug. I originally had the iterator always returning a const
reference, to be consistent with the demand that users use update(). However, had some
kind of difficulty, and never circled back to it (and can't recall why currently).
Will be addressed in the final submission.

Also, keep in mind that the code I have currently has not been submitted to
boost. Its ugly. The interfaces need work, but what currently exists does run. I am not
proud of some of what's in there currently.

> what kind of tree does trans_map use internally?

Red/Black per boost::interprocess::map, which is used (intact) internally for the
low-level map implementation. I'm wrapping that with code to make it
transactionally aware. I didn't want yet another implementation of map.

> I can't wrap my head around the locking or MVCC used for isolation.
> could you explain this a little more in detail?

> does the library automatically
> record/lock accesses? is the user supposed to lock manually on every access?

The isolation behavior is based on RDBMS conventions for row-level locking. Reads do not lock
entries, writes do lock entries. These entry-level locks last until commit/rollback. A row
can be locked directly (for pessimistic locking convention) by using the lock() method
on the entry. This was needed because I do have blocking capabilities on
modifier methods, and thus needed a way to keep one thread's blocking update from
overwriting another. See the section in the "quick start" which shows lock() being used.
It corresponds to the RDBMS concept of "select for update" and is used for the same reasons.

I have been thinking about adding optimistic locking support based on specialization
of the map for a value type implementing some optimistically lockable interface.
Doing so would then allow you to avoid any call to lock(), but present the chance of failure
during update(). This is all per RDBMS row-update conventions, and ultimately, I planned
to support both just as many RDBMS support both, giving the user the choice.

Incidentally, one problem I have with Boost.Persistence is that you always handle
contention by throwing. IIUC contention can occur when two threads insert records with
different keys into a container simply because they would both add their node under the
same parent, and despite their nodes having different keys. So in other words, there's
nothing apps can do to prevent the possibility of contention errors.

This can backfire badly if the probability of contention is high because of app design.
For example, someone writes a transaction which does several things, one of which is
to push_back something to a queue. The transactions which did the push_back would
need to be completely synchronized, IIUC. Support for blocking was one of the RDBSM
worlds answer to this kind of problem, and the reason I included support for it in the trans_map container design.
Thus I'm drawing the conclusion that implementing containers base on the locator concept's
handling of concurrency might not always be the right idea.

btw... I'm still not settled on the interface for lock() / update() and have wondered if I should
somehow put that onto the iterator. E.g. lock() would be a method of the iterator, and
update() could be a method on the iterator which returns a non-const L-value reference
that can then be updated directly. Suggestions welcome...

> how are possible read-accesses over an entire range stored, e.g. by
> using map::equal_range()?

If you hold a shared lock on the map while accessing it via a generic algorithm
you are guaranteed to see the data reflecting a consistent set of transactions.
In between locks, the result of new commits can appear, leading to the various
phantom scenarios defined in the ANSI isolation levels.
Holding a shared lock on the map prevents other transactions from committing
changes to it while the lock is in effect. You aren't meant to hold locks
on the containers for a very long time if you want good concurrency, but it can
be done.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk