Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-29 11:07:01
On 29/03/2017 10:13, AsbjÃ¸rn via Boost wrote:
> On 29.03.2017 08:18, Niall Douglas via Boost wrote:
>> Whatever is lost is lost, the *key* feature is that
>> damaged data doesn't cause further data loss.
> I'm struggling to see how you can guarantee that without _any_
> guarantees from the OS or hardware.
The lack of guarantees only refers to post-power-loss data integrity.
And, as you've mentioned, it's only a portability concern. Specific
combinations of OS kernel version, SCSI controller, SSD etc have
excellent guarantees. The trouble is you can't know whether your
particular combination works reliably or not, or whether it is still
working reliably or not.
For the implementation of Durability, one can assume that everything
works perfectly in between power loss events. That in itself is a bit
risky due to storage bit rot, cosmic ray bitflips and so on, but that's
a separate matter to Durability.
(incidentally, AFIO v2 provides a fast templated SECDED class letting
you repair bitflips from parity info, handy for mature cold storage)
>>> If so, why throw it all away? Maybe the user has an OS, a filesystem and
>>> some hardware which can guarantee this?
>> Because a proper implementation of durability should be able to use no
>> fsync and no O_SYNC at all. In that case, you get "late durability"
>> where minutes of recent writes get lost after power loss. For users
>> where that is unacceptable, O_SYNC should be turned on and you now have
>> "early durability" where only seconds may be lost. You pay for that
>> early durability with much reduced performance.
> Without O_SYNC and fsync, replace "minutes" with "hours" or "days". This
> may be entirely unacceptable. With O_SYNC you get horrible performance
> as you note, which may be entirely unacceptable.
A filing system which takes hours to send dirty blocks to storage is
buggy or misconfigured. Most will write out dirty blocks within 30
seconds of modification whatever the situation.
You are probably referring to "live blocks", so in the past, especially
on Linux, if you repeatedly modified the same block frequently it would
get its age counter reset and so might never be written out.
I don't believe any recent Linux kernel has that problem any more.
"First dirtied" timestamps are kept separate to "last dirtied" nowadays.
If a dirtied block is too old, it'll get flushed.
There is still a problem I believe on Windows with FAT where live blocks
may take far too long to hit storage. But most USB disks are mounted
with write through semantics, so you shouldn't see that problem in most
modern systems which don't tend to have FAT drives with writeback caching.
> Also, I'm assuming the hardware may ignore the O_SYNC as much as it can
> ignore the fsync, in which case you're SOL anyway.
Oh yes. You should also assume O_SYNC does nothing. On some systems, or
some configurations of systems (e.g. inside lxc containers) it really
does do nothing. Thankfully, most lxc containers I've seen in the wild
only disable fsync, not O_SYNC.
Which is another very good reason to not use fsync - people running your
code inside a lxc container get a false sense of security.
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/