Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-25 20:01:21


On Tue, 21 Mar 2017 23:08:30 -0400
Vinnie Falco via Boost <boost_at_[hidden]> wrote:
> On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost
> <boost_at_[hidden]> wrote:
>
[...snip...]
> > The pathological case is power loss, which cannot be unit tested
> > easily (and probably isn't possible).
>
> I think this can be unit tested, and I believe that NuDB's unit test
> covers the case of power loss. I think we can agree that power loss on
> a read is uninteresting (since it can't corrupt data). The unit test
> models a power loss as a fatal error during a write. The test
> exercises all possible fatal errors using an incremental approach (I
> alluded to this in my previous message).
>
> The database template requires a File type to instantiate:
>
> template<class Hasher, class File>
> class basic_store;
>
> NuDB comes with posix_file and win32_file. The unit test uses its own
> fail_file which wraps the native file and generates a simulated error
> on the Nth write:
> https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.hpp#L143
>
> The test implementation tries to insert a bunch of records into a new
> database using a fail_file, incrementing the counter as it goes. On a
> failure it tries to recover (which can also result in a failed write).
> After the recover fails, it verifies that the database is still
> consistent. You can see the test implementation:
> https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
>
> I *think* this test proves that the implementation maintains
> consistency on any power loss. would very much like anyone who is
> interested to tell me whether this approach works.

The other responses to this thread reiterated what I thought could
occur - there should be corruption "races" from a write call to file
sync completion.

Writing the blocks to the log file are superfluous because it is
writing to multiple sectors and there is no mechanism to detect a
partial write after power failure. So the information cannot be read
from reliably. If these writes were omitted, the presence of the log
file header indicates whether a commit failed to complete, and this
header should fit in a single sector. The key and data file sizes in
that header indicate what portion at the end of the key and data files
should be ignored.

If the log file header was moved to beginning of the key file, after the
records are appended and sync'ed, the header could be overwritten with
the new file lengths. The key file would not need to be opened with the
`_POSIX_SYNCHRONIZED_IO` flag, and a fsync after the writing header can
be omitted (which means a commit also has "relaxed" disk flushing).
This should improve commit performance.

At this point I'm sure Niall will tell me that writing <512 bytes to
the beginning of the file is also not an atomic operation. Which is
exactly what SQLite documentation assumes. The power-loss recovery
algorithm gets tougher with this assumption.

> > - The key-file is append only so NuDB has multiple blocks of
> > sorted records. So hopefully the insertions do not cause lots of
> > blocks or the algorithm becomes a linear search.
>
> There's no sorting going on here, the key file is an on-disk hash
> table where each bucket holds a good-sized number of keys (so that a
> single read from the key file does the most work).

I jumped into the internal fetch function which was sorted within a
single bucket and had a linked list of spills. Reading the README first
would've made it clear that there was more to the implementation.

So the worst case performance is a link-list if a hash collision. Was
the primary decision for the default hash implementation performance?

Lee


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk