Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-30 14:08:53
>> If you implemented NuDB as a simple data file and a memory mapped key
>> file and always atomic appended transactions to the data file when
>> inserting items, then after power loss you could check if the key file
>> mentions extents not possible given the size of the data file. You
>> then can rebuild the key file simply by replaying through the data
>> file, being careful to ignore any truncated final append.
> I think this is trading an atomic `truncate(0)` assumption with an
> atomic multi-block overwrite assumption. So this seems like something
> that is more likely to have a torn write that is hard to notice.
Implied in my proposal was that every record appended to the data file
would be hashed. You need this as atomic append does not send those
appends to storage in order, so appends could be torn.
>> That would be a reasonable power loss recovery algorithm. A little
>> slow to do recovery for large databases, but safe, reliable,
>> predictable and it would only run on a badly closed database. You can
>> also turn off fsync entirely, and let the atomic appends land on
>> storage in an order probably close to the append order. Ought to be
>> quicker than NuDB by a fair bit, much fewer i/o ops, simpler design.
> How would it notice that a bucket was partially overwritten though?
> Wouldn't it have to _always_ inspect the entire key file?
My memory mapped key file would be something like:
std::atomic<off_t> data_file_extent, data_file_written;
open_hash_index<key_type, std::pair<off_t, off_t>> keys; // map key
to offset into data file and size of value stored
When opening the database for use, you'd first check that
data_file_extent matches the actual size of the data file. You'd then
check that every key refers to an extent before the data_file_extent. If
anything doesn't match up, you need to rebuild.
The data_file_written atomic BTW is atomically updated when an append
completes, whereas data_file_extent is atomically updated before an
append is begun. This should keep a correct window of extents to check
for torn writes if O_SYNC is working.
But what if O_SYNC isn't working? I'd solve this by writing a timestamp
with each transaction record written to the data file. So you parse
backwards from whatever survived the power loss at the end of the file
by say one minute's worth of writes. Most sane filing systems won't not
write out dirty data after a minute.
That would be my automatic "load the database" implementation, it's a
reasonable balance of performance with checks. Of course one could and
should provide a "scan the entire database for consistency and repair"
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk