Boost logo

Boost Users :

Subject: Re: [Boost-users] thread_group::interrupt_all is not reliable
From: Roland Bock (rbock_at_[hidden])
Date: 2009-12-01 03:29:39

Stonewall Ballard wrote:
> I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
> This gave me the clue to try that:
> <>
>> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
> thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.
> Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.
> I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.
> Of course, this could be a bug in the OS X pthreads implementation as well.


FWIW, I ran that test of yours several times with varying parameters on
my machine (quad core, 64bit, linux) and it did not show a single
failure. Of course, since it is not a deterministic effect even on your
machine, failure to reproduce does not really mean much, but well, I
thought you might like to hear anyway :-)

And I totally agree: Predictable scheduling should not be required to
wake up all threads, especially since the document also says

The pthread_cond_broadcast() or pthread_cond_signal() functions may be
called by a thread whether or not it currently owns the mutex [...]

As for boiling down the application for others to inspect:
Your debugger showed that the thread is still in wait() after the
interrupt call.

Can you assure that ALL worker threads are in wait() prior to the interrupt?
   * If yes: There seems to be no connection with the interlocked queue,
     the sleep and so on. It should be possible to get rid of all that
     for a much simpler test program
   * If no: OK, there seems to be a connection between the wait(), the
     interrupt and the sleep and/or mutex.

In any case, I would assume that by analysing the situation right before
the interrupt, you should be able to reproduce the problem with much
less code.

Hope that helps in any way?



> - Stoney
>> I've discovered that under circumstances apparently related to timing
>> and load, sending interrupt_all to a thread_group when all the threads
>> are waiting on a boost::condition_variable leaves one thread waiting
>> about 1/3 of the time. This is with boost 1_40_0 running on Mac OS X
>> 10.6.2, with 32-bit boost libraries. Boost uses the posix thread
>> system here.
>> I boiled my app down to some test code that runs as a command-line
>> app. It's a bit longer than I'd like, but this configuration seems to
>> be necessary to invoke the problem. The test uses a queue to pass
>> "tasks" from the main thread to worker threads, and another queue to
>> pass "results" back to the main thread. The problem is most apparent
>> when all the tasks are finished and the queue empties, so that all the
>> worker threads are waiting on the input queue when the main thread
>> sends interrupt_all.
>> I've looked at the waiting thread in a debugger when this happens, and
>> found that it has been interrupted, but is still waiting on the
>> condition. It looks like it just got missed by the interrupt_all. This
>> is more likely to happen when there are a lot of worker threads (16,
>> or one per core in my testing).
>> The test code is parked at <>, 20KB. It's
>> an XCode 3.2 project, but the five source files could be readily
>> compiled and run in any Unix environment.
>> I don't see any errors in the code that could cause these failures.
>> There is a work-around, which is to interrupt the waiting thread
>> again. This required a modified version of thread_group so I could do
>> a timed_join_all on it.
>> I welcome any suggestions about what could be wrong here, or ways to
>> simplify the test to make it more suitable for a bug report.
>> - Stoney

Boost-users list run by williamkempf at, kalb at, bjorn.karlsson at, gregod at, wekempf at