Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-30 12:34:07
On Wed, 29 Mar 2017 17:18:23 -0400
Vinnie Falco via Boost <boost_at_[hidden]> wrote:
> On Wed, Mar 29, 2017 at 5:04 PM, Niall Douglas via Boost
> <boost_at_[hidden]> wrote:
> > as that paper Lee linked to points out, everybody writing storage
> > algorithms - even the professionals - consistently gets sudden power
> > loss wrong without fail.
> > That paper found power loss bugs in the OS, filing systems, all the
> > major databases and source control implementations and so on. This
> > is despite all of those being written and tested very carefully for
> > power loss correctness. They ALL made mistakes.
> I have to agree. For now I am withdrawing NuDB from consideration -
> the paper that Lee linked is very informative.
> However just to make sure I have the scenario that Lee pointed out in
> my head, here's the sequence of events:
> 1. NuDB makes a system call to append to the file
> 2. The file system increases the metadata indicating the new file size
> 3. The file system writes the new data into the correct location
> Lee, and I think Niall (hard to tell through the noise), are saying
> that if a crash occurs after 2 but before or during 3, the contents of
> the new portion of the file may be undefined.
> Is this correct?
The best explanation of the problem I have seen is [described in a
paper discussing the implementation details of the ext filesystem on
linux]. The paper also discusses the various issues that the
filesystem designers have had to face, which has been helpful to me.
The thing to remember is that the filesystem metadata is not just the
filesize; the filesystem actually has to write information about which
portions of the disk are in use for the file. This is why a crash
during an append could contain old log file contents after a restart -
the filesystem added pointers to new sectors but not the data at those
> If so, then I do need to go back and make improvements to prevent
> this. While its true that I have not seen a corrupted database despite
> numerous production deployments and over 2TB data file, it would seem
> this case is sufficiently rare (and data-center hardware sufficiently
> reliable) that it is unlikely to have come up.
Yes it is likely pretty rare, especially on a journaled filesystem.
The system has to halt a very specific point in time. This is why I
suggested swapping the order of the writes to:
write_buckets -> fsync -> write_header -> fsync
write_header(zeroes) -> fsync -> truncate(header_size - not `0`)
This still has implementation/system defined behavior, but overwriting
a single sector is more likely to be "atomic" from the perspective of
the filesystem (but not necessarily the hard-drive). And it didn't
require massive structural changes. Writing out a cryptographic hash of
the header would leave a single assumption - fsync is a proper write
barrier in the OS/filesystem and in the hard-drive. Niall has been
particularly harsh on fsync, but I do not think its all bad. With the
exception of OSX, it seems that many filesystems implement it properly
(might regret saying this), and a user can purchase an "enterprise"
hard-drive that is not trying to artificial boost benchmarks stats. At
the very least the number of assumptions has been decreased.
FWIW, I _think_ Niall's suggestion to remove the log file also might be
an interesting to investigate.