[FoRK] A thought on Design and Quality

rst at ai.mit.edu rst at ai.mit.edu
Wed Oct 7 06:14:45 PDT 2009


Dave Long writes:
 > On the other hand, what about a system that would reboot quickly (and  
 > persistently?) enough that rebooting could be the default error  
 > recovery?  (I've even had occasion, on AVR, to have code where  
 > rebooting was part of the normal control flow.  Never resetting the  
 > watchdog timer allows one to have confidence in the recovery  
 > path...)  Has the recovery-oriented computing crowd come up with  
 > anything recently?


Google the phrase "crash-only software", for a fairly recent Stanford
research project.  Part of the rationale, IIRC, is that the way most
stuff is built, you need to provide for recovery from unexpected
crashes, but that's a complicated code path which is rarely exercised,
to the point that you may not be able to trust it when you need it.
Making it the ordinary restart sequence (since a hard crash is the
*only* way to shut it down!) gives it a lot more exercise.

Though there were, of course, software packages written this way
before the Stanford guys.  The most famous example might be the
control software for the fly-by-wire Apollo LEM.  The infamous 1202
and 1201 alarms that it was spitting up all the way to touchdown on
the first lunar landing were both variations on the theme "internal
overflow --- rebooting now".  The software was crashing continually,
but a quick built-in reboot sequence got it running again before it
could crash the hardware.

rst


More information about the FoRK mailing list