Boost logo

Boost :

Subject: [boost] [atomic] Generalized read-modify-write operation
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-09-01 05:35:37


Hi,

I was wondering is there would be interest in a generalized
read-modify-write operation in Boost.Atomic. The idea is to have a
atomic<T>::read_modify_write method that would take a function that
would perform the modify on the atomic value. The interface could be
roughly this:

 template< typename T >
 class atomic
 {
 public:
  // Returns the original value
  T read_modify_write(F modify, memory_order order = memory_order_seq_cst);
 };

Usage example:

 // Saturating increment
 atomic< int > a;
 a.read_modify_write([](int n)
  { return n < std::numeric_limits< int >::max() ? n + 1 : n; });

The implementation could synthesize a CAS loop around the modify
function or (ideally) wrap it inside the LL/SC instructions, also in a
loop.

I believe, in order for this to work the modify function should meet
certain requirements:

1. It must not use atomic operations (on this or other variables) or
thread synchronization primitives.
2. It must not call functions that might involve context switching
(e.g. syscalls, i/o).
3. It should be as small as possible and have the least possible
working set (that is the number of accessed variables).
4. It must be prepared to be called multiple times.
5. It must be prepared to receive a value that has never been stored
in the atomic variable.

The first restriction is essential to avoid deadlocking in case of the
emulated (lock-based) atomic<>. Thread synchronization primitives may
possibly be using atomic ops internally, so are prohibited as well.

The 2 and 3, besides just being something rational one would strive
for anyway, are mostly targeted to allow to inject the function call
inside the LL/SC pair. It is known that some architectures are very
sensitive to what is happening between the LL and SC instructions, so
accessing some memory from the modify function or switching the thread
context will result in SC always failing and an infinite loop.

The 5 would allow to make the first load non-atomic, which can be
faster. The CAS or SC instruction will fail if the loaded value was
inconsistent.

I know there's probably no way to ensure that the function meets these
requirements, and it is possible to get an infinite loop in runtime,
so maybe always using a CAS loop would be a safer approach, even on an
architecture with LL/SC instructions. My hope is that compilers will
someday provide intrinsics for this and will be able to generate
optimal code, when possible.

Does this look interesting to anyone? Comments?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk