Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-26 22:07:14
> You snipped my suggestion after this for swapping the order of the
> header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to
> the reasons you established previously. The only caveat you gave
> earlier was that Linux did not implement this properly for the metadata
> flushing on some of its file systems. Which means the log file could be
> "empty" after a restart in the current implementation, even if both
> fsync checkpoints completed.
The reason I snipped it was because the original algorithm is broken,
and so is yours. You are not conceptualising the problem correctly:
consider storage after sudden power loss to be exactly the same as a
malicious attacker on the internet capable of manipulating bits on
storage to any value in order to cause misoperation. That includes
making use of collisions in weak hashes to direct your power loss
recovery implementation to sabotage and destroy additional data after
the power loss.
The claim of durability in ACID is a very, very strong claim. You are
explicitly guaranteeing when you claim durability that after power loss
your database will always *perfectly* match a correct state from some
time before power loss. That specifically means that all transactions up
to that point are whole and complete, and all data is exactly correct,
and no partial anything or corrupted anything is there.
NuDB is not using cryptographically strong hashing, and is therefore
subject to collision induced post power loss data loss except on filing
systems which provide strong guarantees that corrupted data will never
appear. ZFS and ReFS are one of those, ext4 mounted with "data=journal"
If NuDB clearly said in its docs "no durability guarantees except on
this list of filing systems and mount options: ..." I'd be happy. But it
does not: it makes claims which are obviously wrong. And that's fine as
some library somewhere on github, but if it wants to enter Boost, it
needs to not mislead people or make claims which are patently untrue.
>> You can never assume writes to one inode will reach storage before
>> another in portable code. You can only assume in portable code that
>> writes to the same inode via the same fd will reach storage in the
>> order issued.
> You chopped my response here too, and I think this was in response to
> the COW + inode suggestion. If the design knew COW was available for
> the filesystem in use, couldn't it also know whether data + metadata
> is synchronized as expected? The suggestion clearly was not portable
COW filing systems generally offer much stronger guarantees than non-COW
filing systems. You are correct that if you are operating on one of
those, you can skip a ton of work to implement durability. This is why
AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS are all top of
tree in terms of disabling work done by AFIO and its clients. FAT32,
meanwhile, sits at the very bottom.
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk