Boost logo

Boost :

Subject: Re: [boost] [thread 1.48] Multiple interrupt/timed_join leads to deadlock
From: Gaetano Mendola (mendola_at_[hidden])
Date: 2012-12-05 06:33:15


On 05/12/2012 09.16, Anthony Williams wrote:
> On 04/12/12 18:32, Gaetano Mendola wrote:
>> Hi all,
>> I was investigating a rare deadlock when issuing an interrupt and
>> a timed_join in parallel. I come out with the the following code
>> showing the behavior.
>>
>> The deadlock is rare so sometime you need to wait a bit.
>>
>> I couldn't try it with boost 1.52 because the code is invalid
>> due the precondition of "thread joinable" when issuing the
>> timed_join.
>
> That's a hint.
>
>> Is the code not valid or a real bug?
>
> The code is invalid: you keep trying to interrupt and join even after
> the thread has been joined! Once the thread has been joined, the thread
> handle is no longer valid, and you should exit the loop.

I haven't seen this statement in the documentation.
The loop was meant to exploit exactly this, then you are confirming that
interrupting a joined thread is not valid. How do I safely interrupt
then a thread?
There is no "atomic" check_joinable_then_interrupt, whatching at the
interrupt code it seems that the check is done inside. I'm lost.

In order to cope with a bug in 1.40 (an interrupt to a thread could have
been lost) I have implemented my own ThreadGroup:

ThreadGroup::interrupt_all() {
   for_each_thread(
     boost::thread::interrupt();
     if ( boost::thread::timed_join() ) {
        move_to_next_thread
     }
   )
}

along with the fact that boost::thread_group doesn't provide a method
"join_any" with the semantic to issue an interrupt_all if any of the
threads terminate I have implemented join_any this way:

ThreadGroup::join_any() {
   while(true) {
    for_each_thread(
      if ( boost::thread::timed_join() ) {
        interrupt_all();
      } else {
        move_to_next_thread
      }
    )
   }
}

This has working well for 2 years now. Upgrading to 1.48 I'm
experiencing dead locks and core dumps. The backtrace shows that
a timed_join crashes if somehow the thread terminates at the same
time. Given the fact in the 1.48 documentation there is nothing
written about the fact I can not call a timed_join concurrently with
the interrupt and the fact there is specified no precondition on
the interrupt method I did suppose the above code should have been
armless using the 1.48.
I can try to remove the code issuing an interrupt until the timed_join
doesn't exit successfully, thrusting that after an interrupt a
boost::thread exits if is or reach an interruption point but I'm not
quite convinced that this will solve for sure the deadlocks/crashes.

I will remove the "redundant" code from my ThreadGroup and I will run
my regression tests, I'll be back as soon I have some hints.

Regards
Gaetano Mendola


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk