|
Boost : |
Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Lee Clagett (forum_at_[hidden])
Date: 2017-03-22 02:50:28
On Tue, 21 Mar 2017 20:00:39 -0400
Vinnie Falco via Boost <boost_at_[hidden]> wrote:
> On Tue, Mar 21, 2017 at 7:36 PM, Niall Douglas via Boost
> <boost_at_[hidden]> wrote:
> > On 21/03/2017 20:36, Vinnie Falco via Boost wrote:
> > I am unsure what benefit it
> > confers over the much safer choices of SQLite or UnQLite both of
> > which are battle tested and written by database experts to ensure
> > that other people's data is never, ever lost.
>
> SQLite is painfully slow compared to NuDB. UnQLite supports features
> not present in NuDB such as cursor iteration, moving, and deleting.
> The tradeoff for not supporting these features is that NuDB can make
> better claims about its performance.
>
Have you compared this to LMDB? That should be the best comparison for
performance and features, although I've never seen LMDB compared
directly to UnQLite.
One thing that I noticed is that the documentation of NuDB indirectly
indicates that recent data loss can occur since an insertion is not
immediately synced to disk. This is going to give NuDB a write
advantage when comparing against many databases.
> > I felt when I reviewed NuDB's implementation some time ago that your
> > implementation would be highly susceptible to data loss in a wide
> > set of circumstances.
>
> NuDB makes some assumptions about the underlying file system. In
> particular, that when a call to fsync() returns it is assured the data
> is reliably written. If these invariants are met there is no data
> loss.
>
> In fact, one of NuDB's unit tests exhaustively verifies that the
> database recovery process performs correctly after an I/O failure. The
> unit test works by instantiating a database with a custom File
> template parameter. This custom file implementation simulates an I/O
> failure on the Nth operation. The test runs with N=1 to failure, then
> increments N, and re-runs the test until there are no failures. At
> each iteration, the test verifies that the database recover process
> left the database in a consistent state. You can see this code here:
> https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
>
> I don't doubt that you think this database is vulnerable to data loss
> under some condition. However, I would appreciate specifics so that I
> can verify under what conditions your claim is correct, or if it is
> correct at all.
The pathological case is power loss, which cannot be unit tested
easily (and probably isn't possible). I _think_ you took this into
consideration with the log file approach. But one potential issue is
when power loss occurs before the log file header is completely
synch'ed. The filesystem could sync the FILE metadata first (size, etc),
so the file appears large enough to contain the entire header metadata,
but actually contains nothing or a portion of the data. Niall thoughts ?
Also:
- The key-file is append only so NuDB has multiple blocks of sorted
records. So hopefully the insertions do not cause lots of blocks or
the algorithm becomes a linear search.
- There does not appear to be an advantage to writing the blocks
to the log file first. There is already a time window after an
insertion where data loss can occur, so the data and key file sizes
in the header seem sufficient for rollback.
> > I also felt it wouldn't scale well to the > 1M IOPS
> > storage just around the corner.
>
> While it hasn't yet been tested on these non-existent devices, I am
> optimistic, since NuDB was specifically designed for very intense
> fetch workloads with light concurrent insertions. Fetches can happen
> concurrently, it should be possible to saturate the read throughput of
> a device. See:
> https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_store.ipp#L213
This is surprisingly unlikely to scale to the IOPS Niall mentioned. The
reader lock requires exclusive cacheline access to synchronize with the
writer, which forces the core to wait for a response from other
cores/processors. IIRC the problem is amplified in multi-sockets
systems where the communication must leave the "die". The speculative
locking feature may help here, but Niall tested this a while ago and
showed some surprising performance quirks (although I do not recall
specifics). Niall or Dimov should be better informed on this subject,
and may correct me a bit.
Generally this is why something like LMDB does so well in benchmarks;
the design/implementation was tailored to modern virtual page caching
and cachelines. Which isn't to say its 100% perfect, just that its
likely to perform better than any implementation which does not take
those system designs into consideration.
> > I would suspect it is being
> > used in a highly restricted use case where corner cases don't
> > present.
>
> Maybe. I'll counter by saying, there are only three operations
> possible on an open database: insert, fetch, and close.
>
> > if you are up for getting this to a state ready to submit to Boost,
> > I'd welcome it.
>
> Cool! The docs just need a bit of work, the library code is more than
> ready.
>
> Thanks for taking the time to go through it!
>
Lee
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk