|
Boost : |
From: Federico Abrignani (federico.abrignani_at_[hidden])
Date: 2023-10-21 14:11:10
Dear Boost Community,
I am writing to seek feedback on a new C++ library that I have been
developing (https://github.com/Sernior/fairness
<https://github.com/Sernior/fairness>).
This library, named "*Fairness*," is still a work in progress.
Nonetheless, I believe that this is an opportune time to begin gathering
feedback.
*Fairness* is designed to provide fair synchronization primitives. While
the standard C++ library offers various tools for managing concurrency,
it does not provide mechanisms for handling fairness policies.
To introduce the most fundamental component of my library, the
"priority_mutex," I would like to share a simple example:
/#include <thread>/
/#include <chrono>/
/#include <algorithm>/
/#include <BS_thread_pool.hpp>/
/#include <boost/fairness.hpp>/
/
/
/static boost::fairness::priority_mutex<4> pm;/
/
/
/static void busy_wait_nano(int64_t nanoseconds){/
/auto begin = std::chrono::steady_clock::now();/
/for(;std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now()
- begin).count() < nanoseconds;)/
/Â Â Â continue;/
/}/
/
/
/static void thread_function_nano(int p, int preCriticalTime, int
postCriticalTime, int criticalTime){/
/Â busy_wait_nano(preCriticalTime);/
/Â pm.lock(p);/
/Â busy_wait_nano(criticalTime);/
/Â pm.unlock();/
/Â busy_wait_nano(postCriticalTime);/
/}/
/
/
/int main()/
/{/
/Â std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0};/
/Â std::array<int, 8> preCT {2000, 1500, 2000, 3000, 1000, 500, 500, 2000};/
/Â std::array<int, 8> postCT {5000, 3000, 2000, 2500, 1000, 1500, 1500,
4500};/
/Â int criticalTime = 1000;/
/
/
/Â BS::thread_pool pool(8);/
/
/
/Â for (int i = 0; i < 8; ++i){/
/Â Â Â pool.push_task(thread_function_nano, prios[i], preCT[i],
postCT[i], criticalTime);/
/Â }/
/Â pool.wait_for_tasks();/
/
/
/Â return 0;/
/}
/
/// using BS thread pool https://github.com/bshoshany/thread-pool%c2%a0in
this example./
/
/
In this example priorities are not given randomly, you can see that I
gave 0 priority (the highest priority) to thread 0 which has 2000 + 1000
+ 5000 and thread 7 which has 2000 + 1000 + 4500.
Priority mutexes function much like a `std::array`, using a 0-indexed
based system. For instance, a `priority_mutex<4>` will manage priorities
ranging from 0 to 3, with 0 being the highest priority.
Currently, the library comprises priority mutexes, both recursive and
shared versions, along with RAII wrappers such as `unique_lock` and
`shared_lock`, all of which support priorities. There are, however, some
elements still missing, including priority condvars and timed versions
of the mutexes.
The primary motivation behind creating this library is to enhance
throughput in multi-threaded pipelines.
To illustrate the concept more simply, imagine a pipeline with N tasks,
some of which access a protected resource.
In cases of contention, there exists an optimal order of resource
accesses that minimizes overall execution time,
https://sernior.github.io/fairness/
<https://sernior.github.io/fairness/>Â the images show 2 different ways
of ordering critical sections one of which is much better than the other.
This library aims to empower developers, who have profiled their
software, to manipulate this order with minimal performance overhead and
as simply as possible.
In terms of implementation, much of the development has revolved around
benchmarking and testing. During this process, I've experimented with
over 10 different versions of the priority mutex.
Ultimately, I achieved the best results by using a priority spinlock,
which is also available to library users, to provide internal atomicity
to the rest of the mutexes.
An essential consideration in the implementation is ensuring that the
internal priority spinlock is 128 bytes away from the rest of the memory
used by the mutex especially the memory used to perform the atomic::wait on.
This separation helps avoid false sharing. The choice of 128 bytes
aligns with processor architectures like AMD Zen 4 (the one I am testing
on), as indicated in their optimization manual.
I can attest that without those 128 byte separation the PM_LockUnlock
benchmark was 10X worse and the PM_pipeline_benchmark_fast was worse
than the std counterpart (see below for the benchmarks).
Another aspect to note in the implementation is the use of 4 bytes for
booleans when only a single byte is needed. This decision is related to
the way atomic::wait is implemented. Most standard library
implementations of atomic::wait check if the type size is supported by
the current platform's wait syscall.
For instance, Windows can wait on 1 byte, while Linuxkernel based
systems can only wait on 4 bytes.
These are some benchmark results:
6.4.0-kali3-amd64
AMD Ryzen 5 7600X 6-Core Processor
Now the LockUnlock benchmarks were the most difficult to optimize.
Those are 8 threads simply doing:
m.lock();
m.unlock();
The pipeline benchmarks (with different lenght) are actual pipelines
where each thread has different times before/during/after the critical
section and I try to set the priorities accordingly to finish as soon as
possible.
You can see there is also a mutex called slim_priority_mutex, that is a
priority_mutex implementation that can support up to 7 (15 if boost
atomic is found and your hardware supports the DWCAS) priorities. Those
are light weight versions that only take 128 bits up to 7 priorities and
past that, up to 15, 256 bits.
I am also testing with a custom wait implementations
(https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp
<https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp>)
which can be tested by defining BOOST_FAIRNESS_USE_EXPERIMENTAL_WAIT_NOTIFY.
At some point I wish to make this wait the default version the lib uses
instead of the std one since users of my lib could configure this
version using my configs macros (number of fast spins - relaxed spins -
yield spins before the wait syscall).
As I said in the beginning other than to present my lib, this email has
the purpose of obtaining feedback. I would like to see if there is
anyone (other than me) interested in what I am doing and, if so,
consider that I am still far from finished and I would love suggestions.
Thank you all for your attention.
Best regards,
Federico Abrignani.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk