Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-22 10:43:46


>> I am unsure what benefit it
>> confers over the much safer choices of SQLite or UnQLite both of which
>> are battle tested and written by database experts to ensure that other
>> people's data is never, ever lost.
>
> SQLite is painfully slow compared to NuDB. UnQLite supports features
> not present in NuDB such as cursor iteration, moving, and deleting.
> The tradeoff for not supporting these features is that NuDB can make
> better claims about its performance.

You are correct that NuDB is a single purpose design, and so will always
be faster.

However SQLite configured into "I don't care about data loss" mode is
pretty swift. "Painfully slow" is not a correct term to use except for
its default data loss caring mode.

>> I felt when I reviewed NuDB's implementation some time ago that your
>> implementation would be highly susceptible to data loss in a wide set of
>> circumstances.
>
> NuDB makes some assumptions about the underlying file system. In
> particular, that when a call to fsync() returns it is assured the data
> is reliably written. If these invariants are met there is no data
> loss.
>
> In fact, one of NuDB's unit tests exhaustively verifies that the
> database recovery process performs correctly after an I/O failure. The
> unit test works by instantiating a database with a custom File
> template parameter. This custom file implementation simulates an I/O
> failure on the Nth operation. The test runs with N=1 to failure, then
> increments N, and re-runs the test until there are no failures. At
> each iteration, the test verifies that the database recover process
> left the database in a consistent state. You can see this code here:
> https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
>
> I don't doubt that you think this database is vulnerable to data loss
> under some condition. However, I would appreciate specifics so that I
> can verify under what conditions your claim is correct, or if it is
> correct at all.

The single biggest thing which struck me when reading your
implementation was that you make very strong assumptions about your
storage, and that generally leads to corner case data loss.

Plucking straight from random as it was too long ago I examined your
source, but a classic mistake is to assume this is sequentially consistent:

int keyfd, storefd;
write(storefd, data)
fsync(storefd)
write(keyfd, key)
fsync(keyfd)

Here the programmer writes the value being stored, persists it, then
writes the key to newly stored value, persists that. Most programmers
unfamiliar with filing systems will assume that the fsync to the storefd
cannot happen after the fsync to the keyfd. They are wrong, that is a
permitted reorder. fsyncs are only guaranteed to be sequentially
consistent *on the same file descriptor* not different file descriptors.
Technically speaking, you could even dupfd() the same fd to the same
inode and fsync is not required to be sequentially consistent between
those two fds, though most major filing systems do implement that
because the locking is kept at inode level.

I can't remember if NuDB makes that sort of mistake, but the single
biggest thing which struck me was your assumption that storage is
reliable and predictable and that the syscalls you wrote will be
executed in the order you wrote them. None of those are true except in a
very limited circumstance i.e. power never goes off, storage never
fails, ECC RAM is present, bitflips never occur, you are on a very
recent Linux kernel, your ext4 filing system hasn't been mounted with
unusual flags etc.

You also rely heavily on the kernel page caching architecture to give
your performance. That's a good move as most people trying to invent
their own caching architecture on top of O_DIRECT do a far worse job.
However to really get IOPS out of storage just around the corner where
hitting 1M 4K IOPS is achievable, you can't avoid O_DIRECT for that.

>> I also felt it wouldn't scale well to the > 1M IOPS
>> storage just around the corner.
>
> While it hasn't yet been tested on these non-existent devices, I am
> optimistic, since NuDB was specifically designed for very intense
> fetch workloads with light concurrent insertions. Fetches can happen
> concurrently, it should be possible to saturate the read throughput of
> a device. See:
> https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_store.ipp#L213

The Macbook Pro I am typing on right now hits 300k 4Kb IOPS just fine.
Some of the systems with big RAM drives I've tested AFIO were busting
past 2M IOPS. These are not non-existent devices, they are just
currently high end, and rapidly heading towards the consumer.

>> I would suspect it is being
>> used in a highly restricted use case where corner cases don't present.
>
> Maybe. I'll counter by saying, there are only three operations
> possible on an open database: insert, fetch, and close.

I was more referring to use environment. For example, atomic appending
to a file is super slow on ZFS. It may be fast on ext4 and there you
should use atomic appends, but on other filing systems you need to set
the extent to maximum and use lazy allocation via an atomically updated
manually kept watermark.

There's lots of this stuff, it's why AFIO v2 has a concept of "storage
profiles" which keep a precanned mix of AFIO filing system algorithms
for some storage combo. That way if you load your key-value store on ZFS
on Linux, the storage profile says to use algorithms A, B and D. If you
load it on NTFS on Windows, use algorithms B, C and F instead. AFIO
provides many algorithms all accessible via an abstracted base class so
things using them don't need to care.

>> if you are up for getting this to a state ready to submit to Boost,
>> I'd welcome it.
>
> Cool! The docs just need a bit of work, the library code is more than ready.
>
> Thanks for taking the time to go through it!

I appreciate I'm full of nothing but criticism, but I wanted to give you
some idea of how a peer review would go.

I do want to say thank you for writing your library, and intending to
bring it to Boost. It may well be that the community will accept NuDB if
it very substantially reduces the guarantees it offers to essentially
none i.e. this is a fast key-value insert only store with no guarantees.

But I will mention that if I don't need to provide any guarantees, I
believe I could roll something with AFIO v2 even in its current very
alpha state that will match or beat NuDB in less than a hundred lines of
code and about one week of work. I even reckon I can achieve key-value
deletion if users are willing to accept a reasonable ceiling on total
item count of maybe a few million items, and use of a recent extents
based filing system.

I don't wish to do your work down, but a key value store with no
guarantees really is that easy to implement with AFIO v2, right now.
It's the guarantees and achieving consistent performance across
environments which is hard.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk