Boost logo

Boost Users :

Subject: Re: [Boost-users] thread_group::interrupt_all is not reliable
From: Roland Bock (rbock_at_[hidden])
Date: 2009-12-01 03:29:39


Stonewall Ballard wrote:
> I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
>
> This gave me the clue to try that:
> <http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html>
>> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
>
> thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.
>
> Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.
>
> I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.
>
> Of course, this could be a bug in the OS X pthreads implementation as well.

Hi,

FWIW, I ran that test of yours several times with varying parameters on
my machine (quad core, 64bit, linux) and it did not show a single
failure. Of course, since it is not a deterministic effect even on your
machine, failure to reproduce does not really mean much, but well, I
thought you might like to hear anyway :-)

And I totally agree: Predictable scheduling should not be required to
wake up all threads, especially since the document also says

<snip>
The pthread_cond_broadcast() or pthread_cond_signal() functions may be
called by a thread whether or not it currently owns the mutex [...]
</cite>

As for boiling down the application for others to inspect:
Your debugger showed that the thread is still in wait() after the
interrupt call.

Can you assure that ALL worker threads are in wait() prior to the interrupt?
   * If yes: There seems to be no connection with the interlocked queue,
     the sleep and so on. It should be possible to get rid of all that
     for a much simpler test program
   * If no: OK, there seems to be a connection between the wait(), the
     interrupt and the sleep and/or mutex.

In any case, I would assume that by analysing the situation right before
the interrupt, you should be able to reproduce the problem with much
less code.

Hope that helps in any way?

Regards,

Roland

>
> - Stoney
>
>> I've discovered that under circumstances apparently related to timing
>> and load, sending interrupt_all to a thread_group when all the threads
>> are waiting on a boost::condition_variable leaves one thread waiting
>> about 1/3 of the time. This is with boost 1_40_0 running on Mac OS X
>> 10.6.2, with 32-bit boost libraries. Boost uses the posix thread
>> system here.
>> I boiled my app down to some test code that runs as a command-line
>> app. It's a bit longer than I'd like, but this configuration seems to
>> be necessary to invoke the problem. The test uses a queue to pass
>> "tasks" from the main thread to worker threads, and another queue to
>> pass "results" back to the main thread. The problem is most apparent
>> when all the tasks are finished and the queue empties, so that all the
>> worker threads are waiting on the input queue when the main thread
>> sends interrupt_all.
>>
>> I've looked at the waiting thread in a debugger when this happens, and
>> found that it has been interrupted, but is still waiting on the
>> condition. It looks like it just got missed by the interrupt_all. This
>> is more likely to happen when there are a lot of worker threads (16,
>> or one per core in my testing).
>>
>> The test code is parked at <http://sb.org/ThreadTest.zip>, 20KB. It's
>> an XCode 3.2 project, but the five source files could be readily
>> compiled and run in any Unix environment.
>>
>> I don't see any errors in the code that could cause these failures.
>> There is a work-around, which is to interrupt the waiting thread
>> again. This required a modified version of thread_group so I could do
>> a timed_join_all on it.
>>
>> I welcome any suggestions about what could be wrong here, or ways to
>> simplify the test to make it more suitable for a bug report.
>>
>> - Stoney


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net