Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-22 11:23:54


On 22/03/2017 04:13, Gavin Lambert via Boost wrote:
> On 22/03/2017 16:08, Vinnie Falco via Boost wrote:
>> I think this can be unit tested, and I believe that NuDB's unit test
>> covers the case of power loss. I think we can agree that power loss on
>> a read is uninteresting (since it can't corrupt data). The unit test
>> models a power loss as a fatal error during a write. The test
>> exercises all possible fatal errors using an incremental approach (I
>> alluded to this in my previous message).
>
> A power loss is more like a fatal error that fails to execute any
> subsequent clean-up code, so it might not be quite the same.
>
> There are also more pathological cases such as where a write has been
> partially successful and done some subset of increasing the file size,
> zeroing the extra file space, and writing some subset of the intended
> data. So it's not necessarily that data is missing; there might be
> invalid data in its place.

There are a few rings to testing data loss safety:

Ring 1: Does my application code correctly handle all possible errors in
all possible contexts?

This can be tested using Monte Carlo methods, fuzzing, parameter
permutations, unit and functional testing.

Ring 2: Does my code correctly handle sudden stop?

This can be tested using LXC containers where you kill -9 the container
mid-test. Monte Carlo to verification.

Ring 3: Does my code correctly handle sudden kernel stop?

This can be tested using kvm or qemu where you kill -9 the virtualised
OS mid-test.

Ring 4: Does my code correctly handle sudden power loss to the CPU?

This can be tested using a few dozen cheap odroid devices where you
manually trip their watchdog hard reset hardware feature. This solution
has the big advantage of not requiring the SSD used to be sudden power
loss safe :)

Ring 5: Does my code correctly handle sudden power loss to the storage?

It requires more work and you'll find endless bugs in the kernel, filing
system and the storage device, but you can install a hardware switch to
cut power to the storage device mid-test. This is a never ending "fun"
task, it's far too uncommonly tested by the kernel vendors, but it's a
great simulation of how well faulty storage is handled.

Ring 6: Does my code correctly handle sudden power loss to the system?

Unlike Ring 5 this is actually a better tested situation. Sudden power
loss to everything at once is probably less buggy than Ring 5. Still,
you can get data loss at any level from the kernel, to the SATA chip, to
the device itself.

There are also other test rings not related to sudden power loss. For
example, single and paired bit flips are not uncommon in terabytes of
storage, either transient or permanent. These can be simulated using kvm
with you manually flipping random bits in the disc image as it runs. You
might become quite appalled at what data gets destroyed by bugs in the
filing system when facing flipped bits.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk