Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-30 12:51:15


On Wed, 29 Mar 2017 18:06:02 +0100
Niall Douglas via Boost <boost_at_[hidden]> wrote:
> On 29/03/2017 17:32, Lee Clagett via Boost wrote:
> > Read this [paper on crash-consistent applications][0]. Table 1 on
> > page 5
>
> I particularly like the sentence:
>
> "However, not issuing such an fsync() is perhaps more safe in modern
> file systems than out-of-order persistence of directory
> operations. We believe the developers’ interest in fixing
> this problem arises from the Linux documentation explicitly
> recommending an fsync() after creating a file."

I think in this instance the authors were referring to the
recommendation to fsync a file after creation. The paper is primarily
about the properties of the filesystem, not lies by the hardware.

Later on they comment about how developers generally disregard fsync as
being unreliable, but that its possible that the root cause to their
problems is an incorrect assumption about filesystem properties/behavior
(based on the number of problems they found with common software).

> I agree with them. fsync() gives false assurance. Better to not use
> it, and certainly never rely on it.
>
> > should be of particular interest. I _think_ the bucket portion of
> > NuDB's log has no size constraint, so its algorithm is either going
> > to be "single sector append", "single block append", or "multi-block
> > append/writes" depending on the total size of the buckets. The
> > algorithm is always problematic when metadata journaling is
> > disabled. Your assumptions of fsync have not been violated to
> > achieve those inconsistencies.
>
> One of my biggest issues with NuDB is the log file. Specifically, it's
> worse than useless, it actively interferes with database integrity.
>
> If you implemented NuDB as a simple data file and a memory mapped key
> file and always atomic appended transactions to the data file when
> inserting items, then after power loss you could check if the key file
> mentions extents not possible given the size of the data file. You
> then can rebuild the key file simply by replaying through the data
> file, being careful to ignore any truncated final append.

I think this is trading an atomic `truncate(0)` assumption with an
atomic multi-block overwrite assumption. So this seems like something
that is more likely to have a torn write that is hard to notice.

> That would be a reasonable power loss recovery algorithm. A little
> slow to do recovery for large databases, but safe, reliable,
> predictable and it would only run on a badly closed database. You can
> also turn off fsync entirely, and let the atomic appends land on
> storage in an order probably close to the append order. Ought to be
> quicker than NuDB by a fair bit, much fewer i/o ops, simpler design.

How would it notice that a bucket was partially overwritten though?
Wouldn't it have to _always_ inspect the entire key file?

Lee


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk