Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Vinnie Falco (vinnie.falco_at_[hidden])
Date: 2017-03-22 03:08:30


On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost
<boost_at_[hidden]> wrote:
> Have you compared this to LMDB?

In our application (rippled) LMDB performs quite poorly even compared
to LevelDB, which performed poorly compared to NuDB. Its true that the
rippled workload is quite niche.

> The pathological case is power loss, which cannot be unit tested
> easily (and probably isn't possible).

I think this can be unit tested, and I believe that NuDB's unit test
covers the case of power loss. I think we can agree that power loss on
a read is uninteresting (since it can't corrupt data). The unit test
models a power loss as a fatal error during a write. The test
exercises all possible fatal errors using an incremental approach (I
alluded to this in my previous message).

The database template requires a File type to instantiate:

  template<class Hasher, class File>
  class basic_store;

NuDB comes with posix_file and win32_file. The unit test uses its own
fail_file which wraps the native file and generates a simulated error
on the Nth write:
https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.hpp#L143

The test implementation tries to insert a bunch of records into a new
database using a fail_file, incrementing the counter as it goes. On a
failure it tries to recover (which can also result in a failed write).
After the recover fails, it verifies that the database is still
consistent. You can see the test implementation:
https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125

I *think* this test proves that the implementation maintains
consistency on any power loss. would very much like anyone who is
interested to tell me whether this approach works.

> - The key-file is append only so NuDB has multiple blocks of sorted
> records. So hopefully the insertions do not cause lots of blocks or
> the algorithm becomes a linear search.

There's no sorting going on here, the key file is an on-disk hash
table where each bucket holds a good-sized number of keys (so that a
single read from the key file does the most work).

> This is surprisingly unlikely to scale to the IOPS Niall mentioned. The
> reader lock requires exclusive cacheline access to synchronize with the
> writer, which forces the core to wait for a response from other
> cores/processors.

That's possible. I haven't benchmarked it for that scenario and I am
not knowledgeable on optimizing for memory access / cache performance.

Thanks


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk