Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-29 14:31:14
On 29/03/2017 14:36, Peter Dimov via Boost wrote:
> Niall Douglas wrote:
>> Because a proper implementation of durability should be able to use no
>> fsync and no O_SYNC at all. In that case, you get "late durability"
>> where minutes of recent writes get lost after power loss. For users
>> where that is unacceptable, O_SYNC should be turned on and you now
>> have "early durability" where only seconds may be lost.
> No such thing.
It sure is. For example,
Also, Linux and FreeBSD and even Windows lets you tune when dirty data
is required to be sent to storage. That's late vs early durability too.
In all storage algorithmic code, there is always a tension between late
and early durability. There is no such thing as perfect durability, or
rather, it is impossible to define what perfect durability would be
given all the moving parts. So you have "as soon as possible" durability
which usually comes with mediocre performance, or "somewhat later"
durability with improving performance, with a gradual tradeoff continuum
When designing durable storage algorithms, you choose where on that
continuum you want to be, and you usually add a few knobs the user can
> "The durability property ensures that once a transaction has been
> committed, it will remain so, even in the event of power loss, crashes,
> or errors. In a relational database, for instance, once a group of SQL
> statements execute, the results need to be stored permanently (even if
> the database crashes immediately thereafter)."
> If you lose writes, it's not Durable, just Consistent.
Consistency refers to the referential integrity of the database. So,
after power loss, if a database is Consistent then all its references
are correct. Think of it like a directory containing references to
inodes which don't exist. If that is never the case after power loss,
your filing system implements Consistency.
Durability refers to the writes themselves, so do they appear in whole
as a transaction group or not?
So if transaction A updates references which are used by transaction B,
and if after power loss transaction A was damaged and transaction B was
not, if you are Consistent then you also need to throw away transaction
B during recovery. But if transaction B did not use any inputs from
modifications by transaction A, then if you are Durable you MUST recover
transaction B even though it occurred after the damaged transaction A
which is thrown away.
Does this make sense now?
> The page you linked, https://www.sqlite.org/howtocorrupt.html, says
> "Actually, if one is only concerned with atomic and consistent writes
> and is willing to forego durable writes, the sync operation does not
> need to wait until the content is completely stored on persistent media.
> Instead, the sync operation can be thought of as an I/O barrier. As long
> as all writes that occur before the sync are completed before any write
> that happens after the sync, no database corruption will occur. If sync
> is operating as an I/O barrier and not as a true sync, then a power
> failure or system crash might cause one or more previously committed
> transactions to roll back (in violation of the "durable" property of
> "ACID") but the database will at least continue to be consistent, and
> that is what most people care about."
fsync may be a reordering barrier per inode on some systems. It is
rarely a reordering barrier across inodes, and as I keep saying but
nobody appears to be listening, in portable code you should assume that
fsync does nothing. POSIX allows fsync to do nothing, and
common-in-the-wild configurations such as lxc containers routinely make
fsync into a noop.
So please stop saying "well only if we do X with fsync then it'll work".
No it won't. Assume fsync = noop. Proceed from there when designing high
quality, Boost-ready code.
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk