Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-26 11:11:14
On 26/03/2017 05:08, Lee Clagett via Boost wrote:
> On Sat, 25 Mar 2017 16:22:50 -0400
> Vinnie Falco via Boost <boost_at_[hidden]> wrote:
>> On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
>> <boost_at_[hidden]> wrote:
>>> The other responses to this thread reiterated what I thought could
>>> occur - there should be corruption "races" from a write call to file
>>> sync completion.
>> NuDB makes the same assumptions regarding the underlying file system
>> capabilities as SQLite. In particular, if there are two calls to fsync
>> in a row, it assumes that the first fsync will complete before the
>> second one starts. And that upon return from a successful call to
>> fsync, the data has been written.
> I think SQLite makes more stringent assumptions - that between the
> write and the sync the file metadata and sectors can be written in any
> order. And one further - that a single sector could be partially
> written but only sequentially forwards or backwards. This last
> assumption sounds like a legacy assumption from spinning disks.
If you want durability across multiple OSs and filing systems, you need
to assume that fsync's are reordered with respect to one another. All
major databases assume this, and so should NuDB, or else NuDB needs to
remove all claims regarding durability of any kind. In fact, you might
as well remove the fsync code path entirely, from everything I've seen
to date its presence provides a false sense of assurance which is much
worse than providing no guarantees at all.
Additionally, strongly consider a O_SYNC based design instead of an
fsync based design. fsync() performs pathologically awful on
copy-on-write filing systems, it unavoidably forces a full RCU cycle of
multiple blocks. Opening the fd with O_SYNC causes COW filing systems to
use an alternate caching algorithm, one without pathological performance.
Note that O_SYNC on some filing systems still has the metadata
reordering problem. You should always assume that fsync/O_SYNC writes
are reordered with respect to one another across inodes. They are only
sequentially consistent within the same inode when performed on the same fd.
Again, I'd personally recommend you just remove all durability claims
entirely, and remove the code claiming to implement it as an unnecessary
overhead. You need to start with a design that assumes the filing system
reorders everything all the time, it can't be retrofitted.
>> When there is a power loss or device failure, it is possible that
>> recent insertions are lost. The library only guarantees that there
>> will be no corruption. Specifically, any insertions which happen after
>> a commit, might be rolled back if the recover process is invoked.
>> Since the commit process runs every second, not much will be lost.
>>> Writing the blocks to the log file are superfluous because it is
>>> writing to multiple sectors and there is no mechanism to detect a
>>> partial write after power failure.
>> Hmm, I don't think there's anything superfluous in this library. The
>> log file is a "rollback file." It contains blocks from the key file in
>> the state they were in before being modified. During the commit phase,
>> nothing in the key file is modified until all of the blocks intended
>> to be modified are first backed up to the log file. If the power goes
>> out while these blocks are written to the log file, there is no loss.
You can never assume writes to one inode will reach storage before
another in portable code. You can only assume in portable code that
writes to the same inode via the same fd will reach storage in the order
>>> Was the primary decision for the default hash implementation
>> If you're talking about xxhasher, it was chosen for being the best
>> balance of performance, good distribution properties, and decent
>> security. NuDB was designed to handle adversarial inputs since most
>> envisioned use-cases insert data from untrusted sources / the network.
In which case you did not make a great choice.
Much, much better would be Blake2b. 2 cycles/byte, cryptographically
secure, collision probability exceeds life of the universe.
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk