I am trying to solve a network flow problem by C++ multithreading.

Given a network (all nodes are connected by arcs, each arc is connected to 2 and only 2 ending nodes, one is input node and another is output node, each node can have multiple input arcs and output arcs), each node needs to do some computing and then exchange computing result data to its connected input and output nodes.

Multiple nodes can be grouped into one task, which is run by one thread. In this way, the whole network computing workload can be partitioned into multiple tasks. All these tasks are pushed into a boost thread pool such that all threads can run the tasks at the same time.

But, if a node (in a thread task) needs to do data exchange with another node (in another thread task), there is a synchronization problem. Data receiver needs to wait for data available in the data buffer of the data sender.

My proram needs to partition the network such that each thread's task workload is assigned as evenly as possible. If all threads share the one-large data buffer structure, the program parallelism is not good because the critical section is too large. Some threads have to wait the the one-large data buffer structure unlocked even though the part of the data structure (which is useful to them ) has been available for read or write.

For example, the one-large data buffer structure has the following buffer cells: cell1 , cell2, cell3 , cell4.
When thread 1 is trying to write cell 1, it must lock the whole data buffer structure so that thread 2 cannot read or write cell 2 and so on.

So, I want to break the one-large data buffer structure into multiple distinct data cells according to the thread number so that each cell holds the data only needed by one thread task.
For example, if we have 2 threads, we create 2 data cells that hold data needed by the 4 thread separately. If we have 4 threads, we create 4 data cells that hold data needed by the 4 thread separately. and so on.

My questions are:

(1) How to design the data cell ? You can see that its size is based on the number of threads.

(2) How to reduce synchronization overhead ? The critical section is small but the overhead of geting and releasing mutex may be very high if the inter-node data exchange frequency is high.

(3) When a node's computing is done and data is written to its cell how to notify the data receiver node such that the notification messgae is only received by the waiting thread that run the receiver node computing task. All other unrelated nodes and threads are not impacted.

The program is very time-sensitive and the latency of message exchange should be controlled very toughly and reduced as much as possible.

Any help is really appreciated.