Boost logo

Boost :

From: David Abrahams (dave_at_[hidden])
Date: 2003-09-26 07:40:38


"Gennadiy Rozental" <gennadiy.rozental_at_[hidden]> writes:

>> [I believe there are specific exceptions to this rule for certain
>> kinds of floating point errors, *if* properly and very explicitly
>> managed. See the _fpieee_flt documentation. Eric Niebler knows more
>> about this than I do].
>
> In fact there are series of SEH that look pretty "recoverable". Most of
> them are related to some king of arithmetic errors (float or integer). If
> user willing to take the risk of wrong result it seems perfectly legit to
> continue.

Careful. That can only work if you have continuation-model EH. If
you have termination-model EH (like in C++), there will be trouble if
the arithmetic error occurs in a region that's expected to be
non-throwing. Furthermore, because of FP pipelining, FP exceptions
may occur some random number of instructions after the actual error,
so controlling where the "recoverable" exceptions arise is well-nigh
impossible.

<snip>

>> Now, you might say, "why not just *try* to continue? You might get
>> lucky". That's probably a bad strategy for a testing program, because
>> you also might get _unlucky_. If a bug has caused an SEH (crash), the
>> rest of the results are unreliable anyway.
>
> I do not think that one should continue neither in production code
> nor in test program. Though in some recoverable cases it may be
> user's call. If you got floating underflow it may be just good
> enough to fail current test case. I continue with rest of testing.

Only if you can be sure that the underflow doesn't represent some
deeper logical incoherency in the program, and only if you can turn
the result into a NaN (or something) and continue. If you unwind
inappropriately it could be a disaster.

>> > 3. What is wrong in catching? At least with the purpose of
>> > reporting and invoking usual shutdown mechanisms
>>
>> a. It causes unwinding code such as destructors to execute before the
>> "usual shutdown mechanisms" get a chance to run. That might just
>> cause another crash, which might even cause the shutdown mechanisms
>> to be bypassed.
>
> While this seems to be an issue for scenarios 1 and 3, for testing it mat
> not be such a big problem. After all when you are running a regression test
> suite, you don't really bother was what an actual point where it
> crashed,

I disagree. The next thing that usually happens after a crash during
testing is debugging. In fact, if the developer doesn't have access
to the platform on which it crashed, the actual point where it
crashed might be crucial information.

> while it may be quite useful to try to show extra log information about
> error location and what even more important show the result of testing
> completed by this point - after all crash could've happened in 9th from 10
> test cases. Why should we throw out the work done?

Who said anything about throwing out the work done? I was just
saying you probably shouldn't try to do any *more* work.

>> b. Perhaps more importantly, it causes unwinding code such as
>> destructors to execute before JIT debugging is invoked, which
>> interferes with the programmers' ability to inspect the stack trace
>> and program state in its condition at the point of the crash.
>
> This is the case for scenario 1. For testing this is not true - we are not
> going to invoke the debugger.

But hopefully you're going to report *some* useful information about
the crash, and unwinding can easily interfere with your ability to do
so.

> For production code it probably also preferable to generate core
> vs. try to invoke shutdown procedures.

Often it's better to do some shutdown (like saving an intermediate
work recovery file) before dumping core.

> But this is not definite. User may have external knowledge about the
> code being monitored, that allows one make best decision that
> fits. More over in many c ases even if one wants to generate core,
> some release code may still needs to be invoked. For example to free
> used resources. I was several times hit by the application that does
> not remove some kind of lock when crashes and does not restart until
> I go in some remote location and clean some files.

There are better ways to handle that than by doing it with
unwinding. You can keep a record of the extant locks and explicitly
release them as part of the shutdown procedure.

>> > 4. How the technique described in Dave A. article helps to
>> > resolve a problem discussed in items 3?
>>
>> It causes JIT debugging to be invoked at the point of the crash,
>> instead of in the outermost catch block which catches and rethrows
>> the SEH.
>
> It does help to force immediate "freeze". But this is only if we want it.
>
>> > 5. What would be an ideal behavior?
>>
>> For people running batch regression tests, report the crash *at the
>> point of the crash* (e.g. in the SE translator), possibly print a
>> symbolic stack backtrace, and exit.
>
> I forgot to mention that in some recoverable cases we may try to
> continue.

You have not done any of the required legwork to make sure that you
actually have recoverability from SEHes, and you can't -- it requires
explicit and painstaking cooperation from the program dropped into
your testing framework.

> Also you completely throw out assembled results.

Why do you keep saying that?

> In most cases I am willing to take the risk and resort to Boost.Test
> shutdown procedures, that will show results report.

There's no reason you can't show the results report without unwinding.

> Though I one wants second option should also be available.

You seem convinced that trying to recover from SEHes is a reasonable
default behavior. I think you've got your defaults backwards, but I
think I've spent too many keystrokes writing about this subject over
the years and if I haven't convinced you now, I think I'll just stop.

>> For people debugging their programs, use the technique in my article
>> and invoke JIT debugging at the point of the crash.
>
> This is probably true. Though sometimes you just want to run the rest and
> see the results first. Then if you got fatal error - switch to non catching
> mode and analyze stack in a debugger.
>
> For production code user should be able to make her own decision how
> to deal with fatal errors in monitored code (though it should be
> recommended to generate a core).

Not before trying to save a copy of the work done so far.

> In addition it should be possible to customize immediate shutdown
> procedures to be able to inject required cleaning code.
>
>> --
>> Dave Abrahams
>
> Gennadiy.
>
> P.S. I want to emphasize that this discussion should also be applicable to
> fatal signals caught in signaling capable environment.

I don't believe so. Signals are truly asynchronous and can occur during
nothrow regions such as destructors. Nothing warrants unwinding at
that point.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk