Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-29 06:18:29


On 28/03/2017 17:58, Asbjørn via Boost wrote:
> On 28.03.2017 15:27, Niall Douglas via Boost wrote:
>>> I think there is a point where handling difficult filesystems and
>>> hardware is out of scope for this library.
>>
>> I agree. Just don't claim any guarantees about reliability or data
>> safety and I'm totally happy. I would then advise that if you're not
>> implementing durability, you might as well remove the inefficiency of
>> keeping write journals etc and use a design which goes even faster.
>
> Isn't the point here that NuDB's guarantees depend on the OS, filesystem
> and hardware?

The point I am trying to make is that NuDB's guarantees need to NOT
depend on the OS, filesystem and hardware. Otherwise they are not
valuable guarantees.

You can and should make your durability implementation totally
independent of the system you are running on. It should perform as
perfectly as possible in response to whatever messed up state turns up
after power loss. Whatever is lost is lost, the *key* feature is that
damaged data doesn't cause further data loss.

> If the OS filesystem and hardware guarantees that fsync's
> won't be reordered, then NuDB can guarantee that it is durable (modulo
> any misunderstanding on my part).

If they work on all systems with all hardware then yes, with careful
writing you can achieve durability using write ordering. BSD's UFS is
the classic example of such an implementation.

But far better to not rely on fsync working at all. Remember, fsync is
permitted to do nothing, and it does return before data is on storage on
at least one major OS (OS X).

> If so, why throw it all away? Maybe the user has an OS, a filesystem and
> some hardware which can guarantee this?

Because a proper implementation of durability should be able to use no
fsync and no O_SYNC at all. In that case, you get "late durability"
where minutes of recent writes get lost after power loss. For users
where that is unacceptable, O_SYNC should be turned on and you now have
"early durability" where only seconds may be lost. You pay for that
early durability with much reduced performance.

fsync is the worst method you can use. It has the worst semantics, the
worst performance and is unreliable. Everybody should be using O_SYNC
where possible instead, and your code should still work perfectly
without either O_SYNC nor fsync working. Then your implementation can
correctly claim to implement durability.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk