Boost logo

Boost-Build :

From: Bronek Kozicki (brok_at_[hidden])
Date: 2006-06-25 17:47:01


... which actually is addition that I proposed and implemented some time ago.
As some of you might be aware, for last two weeks or so I kept sending to
Boost.Build mailing list diffs that claimed to fix some problems. I think it
would be nice to describe what these problems are and how I decided to address
them (at least on my local computer). But first something more about this new
functionallity. As you know, some tests may fail, and often they do. When test
fails, sometimes it displays message box with failed assertion or memory
access violation etc. While message box is displayed, the process is blocked
waiting for the user to press "OK" button. The only solution we had so far was
to wait some predefined time (bjam option -l1234 where 1234 is time in secs)
and then kill whole branch of process tree. From testers perspective this
solution is rather poor - for some, predefined period of time computer is
doing nothing, just waiting for someone to press "OK" button. If we (testers)
make this period too short, bjam might kill running program that just takes
longer to complete (eg. compiler struggling with complex templates). If it's
too long, then this idle period might be significat - especially when (if)
there are plenty of errors (try vc8 with /clr ) which leaves less time to run
meaningful tests. Thus idea to automatically close all error messages
displayed during regression tests. First implementation of this idea was
--monitored switch in regression.py , which started special process
automatically closing all windows that looked like error messages. I used it
and was rather unhappy with results - too many innocent windows were closed,
or error reports from programs that have nothing to do with boost regression
tests. Thus I decided to implement solution that could be incorporated into
bjam and Rene Rivera kindly integrated my code into execnt.c on June 7th. Only
little later I have bought big external drive for my laptop and created
another runner (Bronek-v2) for regression tests, this time based on bjam v2
with this update. This allowed me to put it to more stress and (ufortunatelly)
I found some problems with it.

- the idea was to keep message box visible for short period of time. As the
code looks now, message boxes are closed only after process timeouts. Which
nullifies its goal to keep the machine busy running tests and not waiting for
user to click OK button. On the other hand, "review" of all windows is not
free. If it's run each second, bjam consumes about 1% CPU time on my computer
(1.2Ghz Pentium M with 2MB cache which supposedly is equivalent of 2Ghz
Pentium 4) just waiting for its child process to finish. I finally decided to
run it each N iterations of wait loop (the loop that waits for child process
to terminate) where N is (currently) hardcoded as 5. Which means than error
message will be visible for at most 5 seconds and then bjam will close it.
Overhead is further reduced by additional call to IsWindowVisible . There is
still room for improvement in multiprocessor runs (bjam option -j IIRC), when
we are waiting for multiples processes to finish. Current solution is just
looking over all windows for each direct child process separately, which is
quite a pessimisation.

- there is memory leak (or actually handle leak, which is almost the same) in
function is_parent_child - it returns is_parent_child (recursively) without
calling CloseHandle beforehand. If "message box review" is dony only when
process timeouts (as it is now), this memory leak is not very severe, but
things look much worse if "review" is run often. Obviously, I fixed it.

- besides windows displayed by failing test process (assertion) or its child
(memory violation) there are also message boxes displayed by Win32 Subsystem
Server, that is csrss.exe process. These message boxes look like "abc.exe -
Unable To Locate Component" "This application has failed to start because
boost_foo-bar.dll was not found". Obviously, we want to close these too
because it just means that dynamic library (that test depends on) failed to
build, which in itself is valid (although negative) test result.
Unfortunatelly, I'm unable to detect if such message refers to process started
by bjam or something else; I just close it.

- now the difficult part. Windows rather aggressively reuses numbers used as
process identifier (process id) and once process is gone, any next process may
receive the same id. The problem is that there might be processes "orphaned"
by the one that just exited. Any next process may receive the same process id,
and I know no way to distinguish orphaned process from the rest. This means
that for the purpose of building process tree we cannot be sure if
parent process id refers to some of the processes we are waiting for, or it
refers to some other process that is long gone. I do not know good workaround
for it, but I know problems it caused to me: explorer.exe is almost always
orphaned process - it's parent, userinit.exe runs only for short period of
time, just when user logs on. Some concrete numbers - if userninit.exe had
process id 1296 , this process id remains in parent process id field of
explorer.exe . If any of our processes (those started directly or indirectly
by bjam.exe) receives process id 1296, which sooner or later will happen
(because there are thousands of processes created during typical single
regression test run), for the purpose of building process tree it will look
like explorer.exe is child of bjam.exe and all message boxes displayed by
children of explorer.exe will now be closed. One might wonder "but there
usually are no message boxes" but it's true only for system that has very few
programs started automatically on user longon. Many of these program (usually
visible only as icons on system tray area - examples are Toshiba system
utilities, and my laptop is a Toshiba) actually own some simple message box.
bjam.exe, confused by reuse of process id numbers in the guts of operating
system, will close all such programs. This is bad, and only workaround I have
come to is to hardcode rule in execnt.c that "explorer.exe is never child of
bjam.exe". And obviously check IsWindowVisible .

Described fixes are implemented in attached diff.

B.


--- C:\Documents and Settings\Bronek Kozicki\Desktop\execnt.c Thu Jun 8 02:25:58 2006 UTC
+++ E:\DEVEL\BOOST_RTEST\boost\tools\jam\src\execnt.c Sun Jun 25 18:55:27 2006 UTC
@@ -1001,8 +1001,10 @@
 int is_parent_child(DWORD parent, DWORD child)
 {
     HANDLE process_snapshot_h = INVALID_HANDLE_VALUE;
 
+ if (!child)
+ return 0;
     if (parent == child)
         return 1;
 
     process_snapshot_h = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS,0);
@@ -1015,49 +1017,100 @@
             ok = Process32First(process_snapshot_h, &pinfo);
             ok == TRUE;
             ok = Process32Next(process_snapshot_h, &pinfo) )
         {
- if (pinfo.th32ProcessID == child && pinfo.th32ParentProcessID)
+ if (pinfo.th32ProcessID == child)
+ {
+ CloseHandle(process_snapshot_h);
+ if (!stricmp(pinfo.szExeFile, "explorer.exe"))
+ {
+ /* explorer.exe is orphaned and process_id of its parent may
+ accidentally match process_id of process we are after. We must not
+ close dialog boxes displayed by children of explorer.exe even
+ though (thanks to its parent process id) it might appear to be
+ our child. This is not very reliable - there might be more
+ orphaned processes or shell might be something else than
+ explorer.exe, but this is most common and important scenario */
+ return 0;
+ }
+ if (!stricmp(pinfo.szExeFile, "csrss.exe"))
+ {
+ /* csrss.exe may display message box like following:
+ xyz.exe - Unable To Locate Component
+ This application has failed to start because boost_foo-bar.dll
+ was not found. Re-installing the application may fix the problem
+ This actually happens when starting test process that depends on
+ dynamic library which failed to build. We want to automatically
+ close these message boxes even though csrss.exe is not our
+ child process. We may depend on the fact that (in all current
+ versions of Windows) csrss.exe is indirectly child of System which
+ always has process id == 4 */
+ if (is_parent_child(4, pinfo.th32ParentProcessID))
+ return 1;
+ }
                 return is_parent_child(parent, pinfo.th32ParentProcessID);
+ }
         }
 
         CloseHandle(process_snapshot_h);
     }
 
     return 0;
 }
 
-int related(HANDLE h, DWORD p)
+int related(DWORD d, DWORD p)
 {
- return is_parent_child(get_process_id(h), p);
+ return is_parent_child(d, p);
 }
 
+typedef struct PROCESS_HANDLE_ID {HANDLE h; DWORD pid;} PROCESS_HANDLE_ID;
+
 BOOL CALLBACK window_enum(HWND hwnd, LPARAM lParam)
 {
- char buf[10] = {0};
- HANDLE h = *((HANDLE*) (lParam));
+ char buf[7] = {0};
+ PROCESS_HANDLE_ID p = *((PROCESS_HANDLE_ID*) (lParam));
     DWORD pid = 0;
+ DWORD tid = 0;
+
+ if (!IsWindowVisible(hwnd))
+ return TRUE;
 
- if (!GetClassNameA(hwnd, buf, 10))
- return TRUE; // failed to read class name
+ if (!GetClassNameA(hwnd, buf, 7))
+ return TRUE; /* failed to read class name; presume it's not a dialog */
 
     if (strcmp(buf, "#32770"))
- return TRUE; // not a dialog
+ return TRUE; /* not a dialog */
 
- GetWindowThreadProcessId(hwnd, &pid);
- if (related(h, pid))
+ tid = GetWindowThreadProcessId(hwnd, &pid);
+ if (tid && related(p.pid, pid))
     {
- PostMessage(hwnd, WM_QUIT, 0, 0);
- // just one window at a time
+ /* ask really nice */
+ PostMessageA(hwnd, WM_CLOSE, 0, 0);
+ /* now wait and see if it worked. If not, insist */
+ if (WaitForSingleObject(p.h, 200) == WAIT_TIMEOUT)
+ {
+ PostThreadMessageA(tid, WM_QUIT, 0, 0);
+ if (WaitForSingleObject(p.h, 500) == WAIT_TIMEOUT)
+ {
+ PostThreadMessageA(tid, WM_QUIT, 0, 0);
+ WaitForSingleObject(p.h, 500);
+ }
+ }
         return FALSE;
     }
 
     return TRUE;
 }
 
 void close_alert(HANDLE process)
 {
- EnumWindows(&window_enum, (LPARAM) &process);
+ DWORD pid = get_process_id(process);
+ /* If process already exited or we just cannot get its process id, do not go any further */
+ if (pid)
+ {
+ PROCESS_HANDLE_ID p = {process, pid};
+ EnumWindows(&window_enum, (LPARAM) &p);
+ }
 }
 
 static int
 my_wait( int *status )
@@ -1093,23 +1146,29 @@
         }
     
     if ( globs.timeout > 0 )
     {
+ unsigned int alert_wait = 1;
         /* with a timeout we wait for a finish or a timeout, we check every second
          to see if something timed out */
- for (waitcode = WAIT_TIMEOUT; waitcode == WAIT_TIMEOUT;)
+ for (waitcode = WAIT_TIMEOUT; waitcode == WAIT_TIMEOUT; ++alert_wait)
         {
             waitcode = WaitForMultipleObjects( num_active, active_handles, FALSE, 1*1000 /* 1 second */ );
             if ( waitcode == WAIT_TIMEOUT )
             {
                 /* check if any jobs have surpassed the maximum run time. */
                 for ( i = 0; i < num_active; ++i )
                 {
                     double t = running_time(active_handles[i]);
+
+ /* periodically (each 5 secs) review and close alert dialogs hanging around */
+ if ((alert_wait % ((unsigned int) 5)) == 0)
+ close_alert(active_handles[i]);
+
                     if ( t > (double)globs.timeout )
                     {
                         /* the job may have left an alert dialog around,
- try and get rid of it before killing */
+ try and get rid of it before killing */
                         close_alert(active_handles[i]);
                         /* we have a "runaway" job, kill it */
                         kill_all(0,active_handles[i]);
                         /* indicate the job "finished" so we query its status below */


Boost-Build list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk