Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-28 12:45:04
On Sun, 26 Mar 2017 23:07:14 +0100
Niall Douglas via Boost <boost_at_[hidden]> wrote:
> > You snipped my suggestion after this for swapping the order of the
> > header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to
> > the reasons you established previously. The only caveat you gave
> > earlier was that Linux did not implement this properly for the
> > metadata flushing on some of its file systems. Which means the log
> > file could be "empty" after a restart in the current
> > implementation, even if both fsync checkpoints completed.
> The reason I snipped it was because the original algorithm is broken,
> and so is yours. You are not conceptualising the problem correctly:
> consider storage after sudden power loss to be exactly the same as a
> malicious attacker on the internet capable of manipulating bits on
> storage to any value in order to cause misoperation. That includes
> making use of collisions in weak hashes to direct your power loss
> recovery implementation to sabotage and destroy additional data after
> the power loss.
I do understand the problem - I stated 3 times in this thread that even
writing the small header to a single sector could still be problematic.
My suggestions were primarily trying to tweak the existing design to
improve durability with minimal impact. If I convinced Vinnie (and
perhaps even myself) that writing the log header after its contents
could reduce the probability of an undetected incomplete write, the next
obvious suggestion was to append a cryptographic hash of the header(s).
The buckets in the log would then be valid if fsync blocking until
metadata + data completion can be assumed. Even if the hardware lies
about immediately writing to the physical medium, it should still be
reducing the time window where data loss can occur. Hashing over the
entire log file would be a more portable/optimal solution, but adds
_more_ CPU time and would deviate from the current implementation a bit
I think there is a point where handling difficult filesystems and
hardware is out of scope for this library. If the library cannot assume
that a returned fsync call means the hardware "stored" the data
+ metadata, it could make the implementation more complex/costly.
Checking for `_POSIX_SYNCHRONIZED_IO` and calling an OSX `fnctl`
instead of `fsync` is probably the limit of actions a library like NuDB
NuDB already has a file concept that needs documenting and formalizing
before any potential boost review. These harder edge cases could be
provided by an implementation of this concept instead of NuDB directly.
If the highly durable implementation required a noticeable amount of
CPU cycles, existing and new users of the library could remain on the
potentially less durable and faster direct platform versions that
"steals" less CPU cycles from their system.
> >> You can never assume writes to one inode will reach storage before
> >> another in portable code. You can only assume in portable code that
> >> writes to the same inode via the same fd will reach storage in the
> >> order issued.
> > You chopped my response here too, and I think this was in response
> > to the COW + inode suggestion. If the design knew COW was available
> > for the filesystem in use, couldn't it also know whether data +
> > metadata is synchronized as expected? The suggestion clearly was
> > not portable anyway.
> COW filing systems generally offer much stronger guarantees than
> non-COW filing systems. You are correct that if you are operating on
> one of those, you can skip a ton of work to implement durability.
> This is why AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS
> are all top of tree in terms of disabling work done by AFIO and its
> clients. FAT32, meanwhile, sits at the very bottom.
Getting the NuDB file concept to work with AFIOv2 seems like it could
be very useful then. Does v2 have a method for specifying dependency
order on writes (I couldn't find any)? I thought v1 had this feature -
does v2 drop it?
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk