Boost logo

Boost :

From: Gennadiy Rozental (gennadiy.rozental_at_[hidden])
Date: 2003-09-26 06:24:49


"David Abrahams" <dave_at_[hidden]> wrote in message
news:uvfrgprbf.fsf_at_boost-consulting.com...
> "Gennadiy Rozental" <gennadiy.rozental_at_[hidden]> writes:
>
> >> The execution_monitor and family seem a useful facility. I can see
three
> >> phases where it can be useful:
> >>
> >> 1- Development
> >> 2- Testing
> >> 3- Release (field deployment)
> >
> > Ok. Let slow down a little a return to the basics, cause I may be
> > missing some important points. Could we answer following questions
> > in regards to 3 scenarios (under development I mean debugger,while
> > testing is standalone test run) above (3 separate answers on all
> > questions but first):
> >
> > 1. Does SEH is an async exception?
>
> A structured exception is "asynchronous" in the sense that it may
> occur in response to an event that is not expected to throw, from a
> region of code whose correctness relies on the fact that it doesn't
> throw. In other words, nobody expects their code to dereference a
> NULL pointer in the course of normal operation, so with respect to
> program correctness it is as though some other thread or process
> decided to "inject" an exception into the program's execution
> "asynchronously" (on a hypothetical system which supports such
> injection, c.f. Java thread cancellation).

Ok.

> > 2. Does SEH always signal nonrecoverable error?
>
> I'm not certain but I *think* the answer is no, because you can
> explicitly raise an SEH. And of course you can get lucky, and have an
> SEH raised somewhere that a regular C++ exception could normally be
> thrown, in response to a condition whose significance disappears
> during unwinding (e.g. a bad non-owning pointer in an object on the
> stack which gets destroyed during unwinding anyway).
>
> But fundamentally, unless you're explicitly raising an SEH, it signals
> that there's something seriously wrong with your program - something
> which may be the result of any unpredictable kind of corruption.
> Since you don't know what caused the problem (or surely you'd have
> eliminated the bug), you have to assume that it's done some damage
> and is not something you can recover from.

Ok. So we in agreement that I most cases in all above scenarios it does not
make sense to try to ignore SEH and continue *real* work.

> [I believe there are specific exceptions to this rule for certain
> kinds of floating point errors, *if* properly and very explicitly
> managed. See the _fpieee_flt documentation. Eric Niebler knows more
> about this than I do].

 In fact there are series of SEH that look pretty "recoverable". Most of
them are related to some king of arithmetic errors (float or integer). If
user willing to take the risk of wrong result it seems perfectly legit to
continue.

> I can give an example, in fact. I've been using GNU emacs on NT from
> the CVS, and it's been crashing on me for months, intermittently. It
> crashes deep in emacs' win32-specific display code for BDF fonts.
> Well, as it turns out, I've never used a BDF font - I don't even know
> what one is. Somewhere the bit which says "this is a BDF font" gets
> incorrectly set, and the program goes careening off into never-never
> land, executing lots of code before it actually crashes.

So this is an example of bad call, when developer was trying to recover from
the error incorrectly. Or most probably just forgot to perform required
validation. IOW this is different situation: developer did not ignore an
SEH - she did not get one - she just ignored logical error in application,
that leads in crash some other placer, which is actually another error cause
it should've been safeguarded there also.

> Now, if
> I get this crash under the debugger I can force emacs to "unwind" out
> of the problem (by changing the PC), but things are unrecoverably
> messed up and the program just crashes again.

I agree with you that in a debugger in many cases the best possible way to
handle SEH is to "freeze" immediately and let the developer to analyze the
stack.

> Now, you might say, "why not just *try* to continue? You might get
> lucky". That's probably a bad strategy for a testing program, because
> you also might get _unlucky_. If a bug has caused an SEH (crash), the
> rest of the results are unreliable anyway.

I do not think that one should continue neither in production code nor in
test program. Though in some recoverable cases it may be user's call. If you
got floating underflow it may be just good enough to fail current test case.
I continue with rest of testing.

> > 3. What is wrong in catching? At least with the purpose of reporting and
> > invoking usual shutdown mechanisms
>
> a. It causes unwinding code such as destructors to execute before the
> "usual shutdown mechanisms" get a chance to run. That might just
> cause another crash, which might even cause the shutdown mechanisms
> to be bypassed.

While this seems to be an issue for scenarios 1 and 3, for testing it mat
not be such a big problem. After all when you are running a regression test
suite, you don't really bother was what an actual point where it crashed,
while it may be quite useful to try to show extra log information about
error location and what even more important show the result of testing
completed by this point - after all crash could've happened in 9th from 10
test cases. Why should we throw out the work done?

> b. Perhaps more importantly, it causes unwinding code such as
> destructors to execute before JIT debugging is invoked, which
> interferes with the programmers' ability to inspect the stack trace
> and program state in its condition at the point of the crash.

This is the case for scenario 1. For testing this is not true - we are not
going to invoke the debugger. For production code it probably also
preferable to generate core vs. try to invoke shutdown procedures. But this
is not definite. User may have external knowledge about the code being
monitored, that allows one make best decision that fits. More over in many c
ases even if one wants to generate core, some release code may still needs
to be invoked. For example to free used resources. I was several times hit
by the application that does not remove some kind of lock when crashes and
does not restart until I go in some remote location and clean some files.

> > 4. How the technique described in Dave A. article helps to resolve a
problem
> > discussed in items 3?
>
> It causes JIT debugging to be invoked at the point of the crash,
> instead of in the outermost catch block which catches and rethrows the
> SEH.

It does help to force immediate "freeze". But this is only if we want it.

> > 5. What would be an ideal behavior?
>
> For people running batch regression tests, report the crash *at the
> point of the crash* (e.g. in the SE translator), possibly print a
> symbolic stack backtrace, and exit.

I forgot to mention that in some recoverable cases we may try to continue.
Also you completely throw out assembled results. In most cases I am willing
to take the risk and resort to Boost.Test shutdown procedures, that will
show results report. Though I one wants second option should also be
available.

> For people debugging their programs, use the technique in my article
> and invoke JIT debugging at the point of the crash.

This is probably true. Though sometimes you just want to run the rest and
see the results first. Then if you got fatal error - switch to non catching
mode and analyze stack in a debugger.

For production code user should be able to make her own decision how to
deal with fatal errors in monitored code (though it should be recommended to
generate a core). In addition it should be possible to customize immediate
shutdown procedures to be able to inject required cleaning code.

> --
> Dave Abrahams

Gennadiy.

P.S. I want to emphasize that this discussion should also be applicable to
fatal signals caught in signaling capable environment.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk