Boost logo

Boost :

Subject: Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-03-22 11:46:35


On 22/03/2017 10:50, Olaf van der Spek wrote:
> On Wed, Mar 22, 2017 at 11:43 AM, Niall Douglas via Boost
> <boost_at_[hidden]> wrote:
>> Plucking straight from random as it was too long ago I examined your
>> source, but a classic mistake is to assume this is sequentially consistent:
>>
>> int keyfd, storefd;
>> write(storefd, data)
>> fsync(storefd)
>> write(keyfd, key)
>> fsync(keyfd)
>>
>> Here the programmer writes the value being stored, persists it, then
>> writes the key to newly stored value, persists that. Most programmers
>> unfamiliar with filing systems will assume that the fsync to the storefd
>> cannot happen after the fsync to the keyfd. They are wrong, that is a
>> permitted reorder. fsyncs are only guaranteed to be sequentially
>> consistent *on the same file descriptor* not different file descriptors.
>
> Just curious, how is that permitted?
>
> Isn't fsync() supposed to ensure data is on durable storage before it returns?

A common misconception. Here is the POSIX wording:

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer
is implementation-defined. The fsync() function shall not return until
the system has completed that action or until an error is detected.

[SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the fsync()
function shall force all currently queued I/O operations associated with
the file indicated by file descriptor fildes to the synchronized I/O
completion state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion. [Option End]"

So, without _POSIX_SYNCHRONIZED_IO, all fsync() guarantees is that it
will not return until the *request* for the transfer of outstanding data
to storage has completed. In other words, it pokes the OS to start
flushing data now rather than later, and returns immediately. OS X
implements this sort of fsync() for example.

With _POSIX_SYNCHRONIZED_IO, you get stronger guarantees that upon
return from the syscall, "synchronized I/O file integrity completion"
has occurred. Linux infamously claims _POSIX_SYNCHRONIZED_IO, yet
ext2/ext3/ext4 don't implement it fully and will happily reorder fsyncs
of the metadata needed to later retrieve a fsynced write of data. So the
data itself is written on fsync return sequentially consistent, but not
the metadata to later retrieve it, that can be reordered with respect to
other fsyncs.

AFIO v1 and v2 take care of this sort of stuff for you. If you tell AFIO
you want a handle to a file to write reliably, AFIO does what is needed
to make it reliable. Be prepared to give up lots of performance however
(and hence where async file i/o starts to become very useful because you
can queue up lots of writes, and completion handlers will fire when the
write really has reached storage in a way always retrievable in the
future - excluding bugs in the kernel, filing system, storage device etc).

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk