Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-26 04:08:09


On Sat, 25 Mar 2017 16:22:50 -0400
Vinnie Falco via Boost <boost_at_[hidden]> wrote:

> On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
> <boost_at_[hidden]> wrote:
> > The other responses to this thread reiterated what I thought could
> > occur - there should be corruption "races" from a write call to file
> > sync completion.
>
> NuDB makes the same assumptions regarding the underlying file system
> capabilities as SQLite. In particular, if there are two calls to fsync
> in a row, it assumes that the first fsync will complete before the
> second one starts. And that upon return from a successful call to
> fsync, the data has been written.

I think SQLite makes more stringent assumptions - that between the
write and the sync the file metadata and sectors can be written in any
order. And one further - that a single sector could be partially
written but only sequentially forwards or backwards. This last
assumption sounds like a legacy assumption from spinning disks.

I have been (poorly) suggesting that the order of the new buckets in the
log and log file header be swapped. Writing the header which fits in a
single sector is _more_ likely to be all-or-nothing than the bucket
records. At least that's my understanding of this insane situation.

The recover method in NuDB appears to be assuming that if it can read
the entire bucket in the log, that its valid. Using the header to
"post" completed changes seems like it has a smaller window for
failure. If the header was not written correctly, truncate.

>
> When there is a power loss or device failure, it is possible that
> recent insertions are lost. The library only guarantees that there
> will be no corruption. Specifically, any insertions which happen after
> a commit, might be rolled back if the recover process is invoked.
> Since the commit process runs every second, not much will be lost.
>
> > Writing the blocks to the log file are superfluous because it is
> > writing to multiple sectors and there is no mechanism to detect a
> > partial write after power failure.
>
> Hmm, I don't think there's anything superfluous in this library. The
> log file is a "rollback file." It contains blocks from the key file in
> the state they were in before being modified. During the commit phase,
> nothing in the key file is modified until all of the blocks intended
> to be modified are first backed up to the log file. If the power goes
> out while these blocks are written to the log file, there is no loss.

I've proven twice in this thread I cannot read code quickly. Crap. I
thought the key file was append-only like the data file, which is not
possible. No idea why I thought this, other than it would make the
implementation easier. I think the only way to get rid of the log file
in this design is via a COW filesystem (but I have no experience with
those).

> > I jumped into the internal fetch function which was sorted within a
> > single bucket and had a linked list of spills. Reading the README
> > first would've made it clear that there was more to the
> > implementation.
>
> The documentation for NuDB needs work! I can only vouch for the
> maturity of the source code, not the documentation :)
>
> > So the worst case performance is a link-list if a hash collision.
>
> Right, although the creation parameters are tuned such that less than
> 1% of buckets have 1 spill record, and 0% of buckets have 2 or more
> spill records.
>
> > Was the primary decision for the default hash implementation
> > performance?
>
> If you're talking about xxhasher, it was chosen for being the best
> balance of performance, good distribution properties, and decent
> security. NuDB was designed to handle adversarial inputs since most
> envisioned use-cases insert data from untrusted sources / the network.
>

Yes, I asked because I was curious about the security analysis. The only
security analysis I could find quickly was the SMHasher test, and it
got the same score as `murmurhash 3a`. Murmurhash 3 was shown to
vulnerable to collisions, even with a random seed. I could not find the
significance of the 'a' (i.e. `3a`) listed in the table provided by
xxHash.

Lee


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk