Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-26 13:57:39
On Sun, 26 Mar 2017 12:11:14 +0100
Niall Douglas via Boost <boost_at_[hidden]> wrote:
> On 26/03/2017 05:08, Lee Clagett via Boost wrote:
> > On Sat, 25 Mar 2017 16:22:50 -0400
> > Vinnie Falco via Boost <boost_at_[hidden]> wrote:
> >> On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
> >> <boost_at_[hidden]> wrote:
> >>> The other responses to this thread reiterated what I thought could
> >>> occur - there should be corruption "races" from a write call to
> >>> file sync completion.
> >> NuDB makes the same assumptions regarding the underlying file
> >> system capabilities as SQLite. In particular, if there are two
> >> calls to fsync in a row, it assumes that the first fsync will
> >> complete before the second one starts. And that upon return from a
> >> successful call to fsync, the data has been written.
> > I think SQLite makes more stringent assumptions - that between the
> > write and the sync the file metadata and sectors can be written in
> > any order. And one further - that a single sector could be partially
> > written but only sequentially forwards or backwards. This last
> > assumption sounds like a legacy assumption from spinning disks.
> If you want durability across multiple OSs and filing systems, you
> need to assume that fsync's are reordered with respect to one
> another. All major databases assume this, and so should NuDB, or else
> NuDB needs to remove all claims regarding durability of any kind. In
> fact, you might as well remove the fsync code path entirely, from
> everything I've seen to date its presence provides a false sense of
> assurance which is much worse than providing no guarantees at all.
You snipped my suggestion after this for swapping the order of the
header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to
the reasons you established previously. The only caveat you gave
earlier was that Linux did not implement this properly for the metadata
flushing on some of its file systems. Which means the log file could be
"empty" after a restart in the current implementation, even if both
fsync checkpoints completed.
At that point the hope on my part was that the sector metadata for the
log file header remained unchanged on the overwrite. And also hope
that the log file open on initialization with a zeroed header would
provide ample time for a metadata flush. This was a crappy (longer but
still) race condition, AND of course there are the COW filesystems
which have to change the sector for the feature! And since Linux
`_POSIX_SYNCHRONIZED_IO` is (apparently) partially meaningless in
Linux, BTRFS is madness.
So the write order swap suggestion _possibly_ improves durability on
some subset of filesystems. But most users are going to be ext4 Linux
anyway, which sounds like one of the problematic cases.
> Additionally, strongly consider a O_SYNC based design instead of an
> fsync based design. fsync() performs pathologically awful on
> copy-on-write filing systems, it unavoidably forces a full RCU cycle
> of multiple blocks. Opening the fd with O_SYNC causes COW filing
> systems to use an alternate caching algorithm, one without
> pathological performance.
> Note that O_SYNC on some filing systems still has the metadata
> reordering problem. You should always assume that fsync/O_SYNC writes
> are reordered with respect to one another across inodes. They are only
> sequentially consistent within the same inode when performed on the
> same fd.
> Again, I'd personally recommend you just remove all durability claims
> entirely, and remove the code claiming to implement it as an
> unnecessary overhead. You need to start with a design that assumes
> the filing system reorders everything all the time, it can't be
> >> When there is a power loss or device failure, it is possible that
> >> recent insertions are lost. The library only guarantees that there
> >> will be no corruption. Specifically, any insertions which happen
> >> after a commit, might be rolled back if the recover process is
> >> invoked. Since the commit process runs every second, not much will
> >> be lost.
> >>> Writing the blocks to the log file are superfluous because it is
> >>> writing to multiple sectors and there is no mechanism to detect a
> >>> partial write after power failure.
> >> Hmm, I don't think there's anything superfluous in this library.
> >> The log file is a "rollback file." It contains blocks from the key
> >> file in the state they were in before being modified. During the
> >> commit phase, nothing in the key file is modified until all of the
> >> blocks intended to be modified are first backed up to the log
> >> file. If the power goes out while these blocks are written to the
> >> log file, there is no loss.
> You can never assume writes to one inode will reach storage before
> another in portable code. You can only assume in portable code that
> writes to the same inode via the same fd will reach storage in the
> order issued.
You chopped my response here too, and I think this was in response to
the COW + inode suggestion. If the design knew COW was available for
the filesystem in use, couldn't it also know whether data + metadata
is synchronized as expected? The suggestion clearly was not portable
> >>> Was the primary decision for the default hash implementation
> >>> performance?
> >> If you're talking about xxhasher, it was chosen for being the best
> >> balance of performance, good distribution properties, and decent
> >> security. NuDB was designed to handle adversarial inputs since most
> >> envisioned use-cases insert data from untrusted sources / the
> >> network.
> In which case you did not make a great choice.
> Much, much better would be Blake2b. 2 cycles/byte, cryptographically
> secure, collision probability exceeds life of the universe.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk