[FoRK] How Complex Systems Fail
Ken Ganshirt @ Yahoo
ken_ganshirt at yahoo.ca
Fri Nov 6 15:59:18 PST 2009
--- On Fri, 11/6/09, Bill Stoddard <wgstoddard at gmail.com> wrote:
> "7) Post-accident attribution accident to a ‘root
> cause’ is fundamentally wrong. "
> I have a problem with this assertion. It could be that a
> 'root cause' analysis is inconclusive, but to simply wave a
> hand and say 'this is a complex system so we're not going to
> bother trying to understand what happened' is fundamentally
> wrong. Maybe I misunderstand the meaning of 'root cause' in
> this assertion?
Hmmm... Did we read the same paper?
He did not say nor imply anything like that in the paper.
The implication is that, in complex systems "accidents", the usual witch hunt for a single "root cause" of the outage is doomed to failure, by definition. So if the post hoc analysis team is genuinely interested in finding out what really happened they should start the analysis by assuming that a single "root cause" is just a red herring and that the outage will be the result of multiple overlapping failures.
Perhaps you do misunderstand "root cause". Here's Wikipedia's intro:
"Root cause analysis (RCA) is a class of problem solving methods aimed at identifying the root causes of problems or events. The practice of RCA is predicated on the belief that problems are best solved by attempting to correct or eliminate root causes, as opposed to merely addressing the immediately obvious symptoms. By directing corrective measures at root causes, it is hoped that the likelihood of problem recurrence will be minimized. However, it is recognized that complete prevention of recurrence by a single intervention is not always possible. Thus, RCA is often considered to be an iterative process, and is frequently viewed as a tool of continuous improvement."
Note the second and third statements. The implication and, I can confirm from over thirty years of professional experience, the actual practice is that there can be a single cause that is at the "root" of the failure event. That you can hunt it down and take corrective action on that cause.
The good Doctor's point is, that's not only a waste of time, it's harmful because there's never a single cause. There might have been some sort of trigger event, but the critical failure that was triggered was a series of failures, some of which were already in occurence at the time and some which were allowed to happen because of the trigger action.
The point is that identifying and stopping that particular trigger action does not reduce the probability of a critical failure occuring again because it does nothing to reduce or eliminate all those existing and potential failure points in the system. And if the wrong remediation action is taken, it increases rather than decreases the probability of future critical failure.
In addition to the usefulness of the individual points, their totality creates a context. It was not a context of blame-shifting. It was one of trying to understand how complex systems fail and, in so understanding, be better able to properly support such systems and those responsible for operating them.
That's what I read, anyway.
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your favourite sites. Download it now
More information about the FoRK