Boost logo

Boost Users :

Subject: Re: [Boost-users] thread_group::interrupt_all is not reliable
From: Stonewall Ballard (sb.list_at_[hidden])
Date: 2009-11-30 13:36:02


I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.

This gave me the clue to try that:
<http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html>
> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().

thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.

Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.

I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.

Of course, this could be a bug in the OS X pthreads implementation as well.

 - Stoney

> I've discovered that under circumstances apparently related to timing
> and load, sending interrupt_all to a thread_group when all the threads
> are waiting on a boost::condition_variable leaves one thread waiting
> about 1/3 of the time. This is with boost 1_40_0 running on Mac OS X
> 10.6.2, with 32-bit boost libraries. Boost uses the posix thread
> system here.
> I boiled my app down to some test code that runs as a command-line
> app. It's a bit longer than I'd like, but this configuration seems to
> be necessary to invoke the problem. The test uses a queue to pass
> "tasks" from the main thread to worker threads, and another queue to
> pass "results" back to the main thread. The problem is most apparent
> when all the tasks are finished and the queue empties, so that all the
> worker threads are waiting on the input queue when the main thread
> sends interrupt_all.
>
> I've looked at the waiting thread in a debugger when this happens, and
> found that it has been interrupted, but is still waiting on the
> condition. It looks like it just got missed by the interrupt_all. This
> is more likely to happen when there are a lot of worker threads (16,
> or one per core in my testing).
>
> The test code is parked at <http://sb.org/ThreadTest.zip>, 20KB. It's
> an XCode 3.2 project, but the five source files could be readily
> compiled and run in any Unix environment.
>
> I don't see any errors in the code that could cause these failures.
> There is a work-around, which is to interrupt the waiting thread
> again. This required a modified version of thread_group so I could do
> a timed_join_all on it.
>
> I welcome any suggestions about what could be wrong here, or ways to
> simplify the test to make it more suitable for a bug report.
>
> - Stoney

-- 
Stonewall Ballard 
stoney_at_[hidden]           http://stoney.sb.org/

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net