Boost logo

Boost-Build :

From: Jurko Gospodnetiæ (jurko.gospodnetic_at_[hidden])
Date: 2008-01-20 09:27:14


[This is a repost since the original post still has not appeared on the
list and I had not gotten the boost mailing list confirmation e-mail in
8 hours.]

   Hi.

   Bjam has a bug with how it waits for its child processes to terminate
on Windows. There are two problems, both caused by the try_wait()
function in the execnt.c module:

   1. Child process returning the value 259 causes bjam never to detect
that it exited and therefore keep running in an endless loop.

   2. In case when not all available process slots (cmdtab array in
execnt.c) are used, bjam goes into an active loop waiting for one of the
running processes to terminate, using up one of the available processors
just for this wait.

   I am attaching a patch solving both of these problems so please
review it and tell me if it is OK to go ahead and commit it. It uses no
Windows APIs other then those already used so there should be little to
no risk of backward compatibility problems with older OSs.

   Here's the patch description:

-------------------------------------------------------------------

   Solved the problem with child process returning the value 259 causing
bjam never to detect that it exited and therefore keep running in an
endless loop. Done by relying on the Windows WaitForMultipleObjects()
function to detect when a process has exited instead of
GetExitCodeProcess(). The later function's MSDN article warns about this
problem.

   Solved the problem with bjam going into an active wait state, hogging
up processor resources, when waiting for one of its child processes to
terminate while not all of its available child process slots are being
used. To see this bug in action, try compiling a simple C++ program on a
2 processor PC with the -j 2 command-line option and watching how much
processor resources bjam uses while linking. Was caused by treating
unused process slots as used in the try_wait() function, causing the
WaitForMultipleObjects() Windows API call to exit instantly with an
error which was then getting incorrectly interpreted as a timeout,
starting the whole cycle anew.

   Refactored the try_wait() function. Extracted all
GetExitCodeProcess() API calls into one location so it no longer gets
called in three separate places.

   Minor comment changes in code touched by previous fixes.

-------------------------------------------------------------------

   Hope this helps.

   Best regards,
     Jurko Gospodnetiæ

Index: execnt.c
===================================================================
--- execnt.c (revision 42875)
+++ execnt.c (working copy)
@@ -499,7 +499,7 @@
         /* the time data for the command */
         record_times(cmdtab[i].pi.hProcess, &time);
 
- /* Clear the temp file */
+ /* clear the temp file */
         if ( cmdtab[i].tempfile_bat )
         {
             unlink( cmdtab[ i ].tempfile_bat );
@@ -507,6 +507,9 @@
             cmdtab[i].tempfile_bat = NULL;
         }
 
+ /* find out the process exit code */
+ GetExitCodeProcess( cmdtab[i].pi.hProcess, &cmdtab[i].exitcode );
+
         /* the dispossition of the command */
         if( intr )
             rstat = EXEC_CMD_INTR;
@@ -884,43 +887,41 @@
     }
 }
 
-/* waits for a single child process command to complete, or the
- timeout, whichever is first. returns the index of the completed
- command, or -1. */
+/* waits for a single child process command to complete, or the timeout,
+ whichever comes first. returns the index of the completed command, or -1. */
 static int try_wait(int timeoutMillis)
 {
- int i, num_active, waiting;
+ int i;
+ int num_active;
+ int wait_api_result;
     HANDLE active_handles[MAXJOBS];
     int active_procs[MAXJOBS];
 
- for ( waiting = 1; waiting; )
+ /* prepare a list of all active processes to wait for */
+ for ( num_active = 0, i = 0; i < globs.jobs; ++i )
     {
- /* find the first completed child process */
- for ( num_active = 0, i = 0; i < globs.jobs; ++i )
+ if ( cmdtab[i].pi.hProcess )
         {
- /* if we have an already dead process, return it. */
- cmdtab[i].exitcode = 0;
- if ( GetExitCodeProcess( cmdtab[i].pi.hProcess, &cmdtab[i].exitcode ) )
- {
- if ( STILL_ACTIVE != cmdtab[i].exitcode )
- {
- return i;
- }
- }
- /* it's running, add it to the list to watch for */
             active_handles[num_active] = cmdtab[i].pi.hProcess;
             active_procs[num_active] = i;
- num_active += 1;
+ ++num_active;
         }
-
- /* wait for a child to complete, or for our timeout window to expire */
- if ( waiting )
- {
- WaitForMultipleObjects( num_active, active_handles, FALSE, timeoutMillis );
- waiting = 0;
- }
     }
-
+
+ /* wait for a child to complete, or for our timeout window to expire */
+ wait_api_result = WaitForMultipleObjects( num_active, active_handles,
+ FALSE, timeoutMillis );
+ if
+ (
+ ( WAIT_OBJECT_0 <= wait_api_result ) &&
+ ( wait_api_result < WAIT_OBJECT_0 + num_active )
+ )
+ {
+ /* terminated process detected - return its index */
+ return active_procs[ wait_api_result - WAIT_OBJECT_0 ];
+ }
+
+ /* timeout */
     return -1;
 }
 
@@ -941,9 +942,7 @@
                 close_alert(cmdtab[i].pi.hProcess);
                 /* we have a "runaway" job, kill it */
                 kill_process_tree(0,cmdtab[i].pi.hProcess);
- /* and return it as complete, with the failure code */
- GetExitCodeProcess( cmdtab[i].pi.hProcess, &cmdtab[i].exitcode );
- /* mark it as a timeout */
+ /* and return it marked as a timeout */
                 cmdtab[i].exit_reason = EXIT_TIMEOUT;
                 return i;
             }


Boost-Build list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk