Software failure (and reliability)


For those of you interested in such things, there’s an interesting thread running over on the Safety Critical Mail List at Bielefeld on software failure. Sparked off by Peter Ladkin’s post over on Abnormal Distribution on the same subject. Whether software can be said to fail and whether you can use the term reliability to describe it is one of those strange attractors about which the list tends to orbit. An interesting discussion, although at times I did think we were playing a variant of Wittgenstein’s definition game.

And my opinion? Glad you asked.

Yes of course software fails. That it’s failure is not the same as the pseudo-random failure that we posit to hardware components is neither here nor there.

Compare and contrast. In the Ariane 501 accident the mission software received an unexpected input (trajectory) that caused it to fail (operand error) leading in turn to the subsequent loss of the vehicle. Now we turn to the Challenger accident, here the SRB flexible joint design was exposed to an unexpected input (low temperature on launch) that caused it to fail (allow exhaust gas to blow through the joint). In neither case was the cause related to some inherent internal pseudo random failure process, but rather the interaction of the design with the environment. No one I know would dispute that the shuttle flexible joint design failed, so why would one dispute that the Ariane 501 software likewise failed?

So can we apply reliability techniques to software, in the same way that we do to hardware?

OK, so here’s a dirty little secret of reliability engineering. Hardware reliability engineering for anything other than simple mechanisms really doesn’t work. In essence we make a series of simplifying assumptions about the environment, failure behaviour, failure causes (predominantly of simple components) all so we can get to a random failure rate and simplify our analytical life. The ‘reliability data’ we base this upon is of dubious validity and uncertain applicability. Out of said simple components and their attached failure data we then build a model (fault tree, Markov, FMEA you pick) which again has a series of assumptions built in. Then we crank the handle of our analytical engine and voila, we get a number. If the level of reliability is sufficiently high there’s no way we can demonstrate the validity of such a number before we field our system, at which point the question becomes somewhat academic. One can as reliability engineers like to joke, fit any curve with log/log graph paper and a magic marker. 🙂

The Can Do Principle bids the individual cognitive agent to try to tailor the advancement of his agendas to principles he is already at home with and to problem-solving techniques over which he already has attained a certain mastery

J.Woods, Eighth Theses  Reflecting on Toulmin, , Chapter 25

The problem as I see it lies in what Woods calls the Can-Do/Make Do trap, which tempts analysts to always take the tools that they are familiar with and bend them to the problem at hand. Sometimes this works, and sometimes that bending distorts their conceptual integrity to the point of breaking. In the case at hand that’s taking reliability analysis techniques that were just adequate to analyse the performance of simple small scale systems of the 1950s, and applying them to complex systems in the large of the current era. We have, it seems, elided from ‘Can Do’ to ‘Make Do’ all to easily. Woods also makes the point that while we should show caution in such endeavours it’s difficult to do so from inside the program so to speak, especially as such models have a terribly seductive ‘truth in the model’ normative appeal.

Does this mean that we should throw out all methods based on the use of statistical methods?

My answer to that is hell no, you work with the information that you have and in some circumstances it may be useful to apply such techniques to the behaviour of a system, or it’s components be they hardware or software. However if and when you do so you should remember that ‘can do’, can all to easily become ‘make do’ with all the attendant hazards. Knowing your statistical onions must therefore be a firm prerequisite to the application of such techniques, unfortunately in my experience most engineers, and even a good proportion of the reliability and safety community, do not qualify.