Boost logo

Boost :

From: André Almeida (andrealmeid_at_[hidden])
Date: 2021-05-15 17:25:02

Às 15:05 de 14/05/21, Andrey Semashev via Boost escreveu:
> On 5/14/21 7:35 PM, André Almeida wrote:
>> Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:
>>> Beware of a long post.
>>> Of the libraries I mentioned, the prime user of futex2 would be
>>> Boost.Atomic. With the current implementation based on existing futex
>>> API, the important missing part is support for futex sizes other than 32
>>> bits. This means that for atomics other than 32-bit Boost.Atomic must
>>> use an internal lock pool to implement waiting and notifying operations,
>>> which increases thread contention. For inter-process atomics, this means
>>> that waiting must be done using a spin loop, which is terribly
>>> inefficient. So, the support for 8, 16 and 64-bit futexes would be very
>>> much needed here.
>> Cool, I'll be adding "better support for userspace atomics" as an use
>> case for variable size futexes.
>> So far, I've been advertising that variable sized futexes would be good
>> for saving some memory. Do you think that using 8bit sized futexes for
>> e.g. Boost's mutexes would save something or would be unlikely noticed?
> Honestly, I don't think small-sized futexes would be noticeable in terms
> of memory consumption in real world applications, aside from extremely
> low end embedded domain (think microcontrollers with kilobytes of
> memory). And in those kind of devices you may not have a Linux kernel in
> the first place. Even mobile phones these days come with several gigs of
> RAM.
> Most of the data is 4 or 8 byte aligned, so making a futex smaller than
> that would likely just contribute to padding. And, to avoid false
> sharing, you generally want to make sure the mutex and the protected
> data is within a cache line (typically, 64 bytes) and no other,
> unrelated data is in that cache line. So unless you're creating
> thousands of futexes and want to pack them tightly (which is detrimental
> to performance), I don't think the small futex size is a meaningful
> advantage.
> Small sized futexes are mostly useful when you already have a small data
> item, and you also want to block on it, which is the use case of atomics.

Noted, thanks.

>>> As for use cases outside Boost, that application that I described in the
>>> LKML post would benefit not only from 64-bit futexes but also from the
>>> ability to wait on multiple futexes.
>>> [...]
>>> I can provide more details on this use case, if you're
>>> interested.
>> Please, I would love to hear more about that.
> Well, I've already described most of it in the LKML post and my previous
> reply. To recap:
> - The futexes are created in shared memory and accessed by multiple
> processes.
> - The futexes are created in threes, one is used as a mutex and the
> other two as condition variables. We use FUTEX_REQUEUE to move blocked
> threads from cond vars to the mutex to wake them up upon releasing the
> mutex.
> - The mutex futex value is divided into sets of bits. Some bits are used
> for ABA counter and some - for PID of the owner to implement robust
> semantics. This is where a larger futex would be useful, as the 32-bit
> putex is not large enough to accommodate a full PID.
> - For cond var futexes, we're using the futex bitset feature to
> selectively wake up threads subscribed to different events. In
> particular, this allows a thread to wake up all other threads within the
> same process but not the threads from other processes, which is useful
> for indicating process-local events (e.g. a request for termination).

Right, so you have a nice custom sync mechanism implementation. If isn't
confidential, can you explain which kind of application/work load are
you dealing with? e.g. it's a database

> Looking at our current implementation, we are missing a
> FUTEX_REQUEUE_BITSET operation, which would act as FUTEX_REQUEUE, but
> only for blocked threads with a matching bitset. Because of that we have
> to use FUTEX_WAKE_BITSET, which may cause a thundering herd effect.
> It is possible to replace the futex bitset functionality with more
> futexes (with the ability to wait on multiple futexes). However, I
> suspect, in some cases such replacement would not be so easy. The
> problem is that with bitsets the notifying thread only has to perform
> one syscall to notify of multiple events (FUTEX_WAKE_BITSET), while with
> multiple futexes one would have to perform one FUTEX_WAKE per futex
> representing an event. I *think*, in our case we can avoid this overhead
> and implement a scheme where notifying (or rather, requeueing from) only
> one futex is enough. But I'm not sure this will be doable in other
> applications.
> Also, I'm not sure if the increased number of futexes would cause more
> overhead on the kernel side. Did you perform benchmarks comparing futex
> bitsets and an equivalent logic with multiple futex2 futexes? I mean,
> have two configs:
> - N threads multiplexed on a single futex using different bits in a bitset.
> - N threads, each waiting on N futex2 futexes.
> And compare the cost of blocking, waking and requeueing a thread in each
> config.
> In any case, I think, futex bitsets are a useful feature, even if our
> application can do without it.

I didn't benchmark that, because I never thought of wait on multiple as
a replacement for futex bitset operations, but I'll have a look on that.
Also, the current API doesn't accommodate bitset for now, so it's good
to know that there are users out there, I thought this feature wasn't
used nowadays.

Boost list run by bdawes at, gregod at, cpdaniel at, john at