|
Boost Users : |
Subject: Re: [Boost-users] [Asio] io_service destructor never completes
From: OvermindDL1 (overminddl1_at_[hidden])
Date: 2010-07-04 22:41:27
On Fri, Jul 2, 2010 at 9:55 AM, Ragnar Cederlund
<ragnar.cederlund_at_[hidden]> wrote:
> Hi,
>
> I'm having trouble with the io_service destructor never completing. The
> application I am working on uses one worker thread to handle all actual work
> to be done including communication on a socket in handler functions posted
> to the asio queue and another thread to handle the gui of the application.
>
> At a certain point in time the program decides that it has nothing more to
> do and calls the io_service::stop function. Shortly thereafter the thread
> returns from its call to the io_service::run function. The next thing that
> happens is that the io_service object goes out of scope and thus the
> io_service destructor is called. After this the thread signals that it is
> finished and the application exits.
>
> The problem I am seeing is that the destructor never finishes. Or actually,
> a customer sees this problem, but I have not been able to reproduce this so
> all the information I have comes from adding (excessive) logging in the
> boost source.
>
> The io_service destructor loops through all services and calls
> shutdown_service on each one. The win_iocp_io_service handles this call by
> entering a while loop that should complete when the outstanding_work_
> variable becomes zero. Inside the loop, there is a call to
> GetQueuedCompletionStatus to dequeue a completion packet, which is then
> tested to see if it contains a valid operation that is then destroyed and
> the outstanding_work_ variable is decremented.
>
> In all the runs I have made, outstanding_work_ is initially 1 when the
> shutdown_service function is entered. The first and second calls to
> GetQueuedCompletionStatus come back immediately and the second results in
> the outstanding_work_ variable being decremented and the loop exits.
> Everything is fine.
>
> At the customer location however, the second call times out instead of
> completing immediately and it continues to do so for subsequent loop
> iterations until the application is forcibly closed. There is only one of
> our customers who gets this problem. The customer runs Windows XP SP3, but I
> have failed to reproduce this on my own XP test machine.
>
> We have tried both Boost version 1.43 and version 1.40 with the same result.
> In both cases, the customer has the problem but I cannot reproduce it. The
> asio behavior has changed somewhat between the versions and the description
> of the problem above is with version 1.43.
>
> Does anyone have any idea of what may be going on? Any hints or tips are
> appreciated! Could we be using asio wrong in some way that could cause this,
> but we're just lucky most of the time? The customer may be running in a
> virtual machine, could this affect io completion ports in a way that could
> be problematic for asio?
I have been experiencing this *EXACT* same issue, exactly as described
above, when using an external library that uses Boost.ASIO internally
to host a webserver.
It happens even when running the examples, locally.
Here is the copy/paste of my email in their mailing list when I was
trying to figure it out yesterday:
On Sat, Jul 3, 2010 at 8:16 PM, OvermindDL1 <overminddl1_at_[hidden]> wrote:
> On Sat, Jul 3, 2010 at 7:32 PM, OvermindDL1 <overminddl1_at_[hidden]> wrote:
>> On Sat, Jul 3, 2010 at 7:31 PM, OvermindDL1 <overminddl1_at_[hidden]> wrote:
>>> On Sat, Jul 3, 2010 at 5:57 PM, OvermindDL1 <overminddl1_at_[hidden]> wrote:
>>>> On Sat, Jul 3, 2010 at 3:48 PM, OvermindDL1 <overminddl1_at_[hidden]> wrote:
>>>>> On Sat, Jul 3, 2010 at 5:47 AM, Alex Sadovsky <alexeysad_at_[hidden]> wrote:
>>>>>> On Sat, Jul 3, 2010 at 10:46 AM, OvermindDL1 <overminddl1_at_[hidden]> wrote:
>>>>>>>
>>>>>>> On Fri, Jul 2, 2010 at 11:03 PM, OvermindDL1 <overminddl1_at_[hidden]>
>>>>>>> wrote:
>>>>>>> > I am still trying to figure out if it is something with my setup or
>>>>>>> > what, but everytime my server gets under any tiny bit of load from
>>>>>>> > multiple users, it just 'stops', will not respond to localhost, just
>>>>>>> > 'nothing', in both debug and release.
>>>>>>> >
>>>>>>> > If I run it under debug in Visual Studio, VS (or one of my plugins,
>>>>>>> > unsure anymore, it all just melds together over time) reports that the
>>>>>>> > app is deadlocked. This happens with even simple examples that
>>>>>>> > require a session, like the Bob Smith example (which is what I am
>>>>>>> > trying to trace it through currently since it is so simple).
>>>>>>> >
>>>>>>> > Windows XP SP3.
>>>>>>> > 4 gigs of ram (although XP 32-bit only uses ~3gigs).
>>>>>>> > Computer is not under load.
>>>>>>> > Using the simple Bob Smith example that comes with the latest Wt
>>>>>>> > Master/head, all build types, just get two people to connect at once,
>>>>>>> > seems to cause a deadlock (very easy to reproduce when they refresh
>>>>>>> > the page repeatedly). Any thoughts?
>>>>>>>
>>>>>>>
>>>>>>> Interestingly, the browser still initiates a connection to the server,
>>>>>>> but never gets a response, it just hangs waiting for the response...
>>>>>>>
>>>>>>
>>>>>> I can confirm the same behavior on a Win2003r2/32bit sytem. I haven't dug
>>>>>> into this too much though.
>>>>>
>>>>> I am using Visual Studio 2005 Standard for note, and although I have a
>>>>> deadlock detector, I do not have one that can identify them ahead of
>>>>> time nor what causes them. I might try to look at this later if I get
>>>>> a chance.
>>>>
>>>> Actually I have a moment, and did just a quick check to see what the threads do.
>>>>
>>>> Using the bobsmith.wt.exe demo, I launched the debug version (so I can
>>>> see it stack data easily), opened up a browser, and held down refresh
>>>> for a couple of seconds. At that time I saw all the worker threads
>>>> just die, vanish, kaput, there were two threads left at that point,
>>>> one was some wmifree thread, but it disappeared a few seconds later,
>>>> and the last one remaining is the mainCRTStartup thread, the top of
>>>> its stack holding:
>>>> bobsmith.wt.exe!boost::this_thread::interruptible_wait+0x211
>>>> So it is still waiting for the threads to return, but since they all
>>>> died, it does not.
>>>>
>>>> Running it again, I caught the entire stack for the second thread that
>>>> lives for a few seconds after the thing dies, it is:
>>>> ADVAPI32.dll!WmiFreeBuffer+0x24e
>>>> kernel32.dll!GetModuleFileNameA+0x1ba
>>>>
>>>> I ran it again, and set the thread count to 1, and debugged that
>>>> thread, apparently when it seems to 'lock up', the ASIO do_one
>>>> function never returns, it never gets signal from the socket that
>>>> there is incoming data, and then the socket closes, which kills ASIO.
>>>>
>>>> Debugging into the actual GetQueuedCompletionStatus call in ASIO, it
>>>> seems that it always return status code WAIT_TIMEOUT when not serving
>>>> anything, but the last time through, the one that it is killed right
>>>> after, the error code is 0.
>>>>
>>>> Debugging into it and tracing it line by line, it seems that the
>>>> thread stopped flag becomes set, trying to find out why...
>>>>
>>>> Searched for every place that variable is ever set to 1, which is
>>>> apparently only one place, the stop() method, to try to figure out
>>>> what is setting it, and it is hit, so something is indeed setting it,
>>>> tracing back up reveals it is called by:
>>>> bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::work_finished()
>>>> Line 315 C++
>>>> bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::auto_work::~auto_work()
>>>> Line 641 C++
>>>> bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::do_one(bool
>>>> block=true, boost::system::error_code & ec={...}) Line 548 + 0x19
>>>> bytes C++
>>>> bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::run(boost::system::error_code
>>>> & ec={...}) Line 229 + 0xe bytes C++
>>>> bobsmith.wt.exe!boost::asio::io_service::run() Line 68 + 0xf bytes C++
>>>> bobsmith.wt.exe!http::server::Server::run() Line 213 C++
>>>>
>>>> So basically, it is saying that it is running out of work, hmm...
>>>>
>>>> After some more digging, it seems that the on_completion handler for
>>>> handling a new socket is failing for some reason, which is causing the
>>>> work to be decremented, without it ever being incremented, trying to
>>>> figure out what is going on here, but regardless, with the auto-work
>>>> holder is released right after it, it causes the thread to close since
>>>> there is no more work.
>>>>
>>>> Okay, figured out what is closing it all, the IO completion port call
>>>> returns ERROR_SEM_TIMEOUT, so the semaphore is timing out, and it
>>>> seems like it is timing out *before* the timeout call, this generally
>>>> indicates a failure condition in the port, hmm...
>>>
>>> Hmm, based on some other information noticed, it seems there is a bad
>>> WMI state at the time that happens, may or may not be related,
>>> although probably is...
>>
>> Also, this happens on two different computers with two very different
>> networking cards (same OS though).
>
> Decided to watch the network traffic, apparently what kills it pretty
> well is opening up about 8 connections at once to it, most are in
> CLOSE_WAIT status, a few in ESTABLISHED, and the ones that locked up
> are in SYN_RCVD, but the computer never gets another response, and it
> seems that when any of these SYN_RCVD state ports are timed out, the
> program receives the ERROR_SEM_TIMEOUT (which is not related at all to
> the timeout given to the function I found out by reading, that would
> return something different). Technically this error should never be
> thrown from that function according to MSDN, but it is, maybe a bug in
> the Windows Networking Stack pre-Vista, unsure yet, can anyone else
> find out anything?
My test computers are both Windows XP SP3, compiled using Visual
Studio 2005, fully updated. I have been trying my darndest to figure
out what is going on in that handler that calls that one destructor
that sets the reference count low enough that the thread dies on the
next invocation.
I honestly do not know ASIO nor IOCP well enough to know what is going
on, but if anyone does, and they have access to VS 2005, I can setup a
remote debugging platform so they can debug into the problem as it
happens on my computer since I can reliably cause it to happen every
single time.
The above post, Ragnar Cederlund, if you want, I can test your product
as well, if it happens on my computer too then I can let you remote
debug it in real-time (using the remote debugging capabilities of
Visual Studio 2005, or I can setup a quick VNC server for local access
if you prefer that, if you do not want me to have access to your code
then the remote debugging capabilities of your IDE will work, although
I do not have any above 2005, I can install the remote debugging
toolkits from the microsoft site if you need me to).
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net