What does ‘That’s not a credible failure mode.’ really mean?

27/10/2012 — Leave a comment

The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.

In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.

[Start] A typical designer’s total hands-on system experience time is almost non-existent compared to typical system requirements and fleet exposures.

  • Safety-critical systems usually require the probability of failure to be less than 1/107to 1/109for a one hour exposure (= 107to 109hours MTBF, sort of)
  • A typical designer (20 years into a 40 year career)
    • Has less than 5,000 hours of real hands-on system experience… and, almost none of this is in the system’s real environment (When was the last time you saw a designer riding in an electronics bay of an aircraft?)
    • Sees a negligible number of failed units returned from the field:
      • 90% are transient and not returned: retest OK (RTOK), cannot duplicate (CND)
      • Most units are repaired (or scrapped) without being returned to the manufacturer
      • Manufacturers’ failure analysis teams are not the designers

So, when a designer says that the failure can’t happen, this means that it hasn’t been seen in less than 5,000 hours of observation:

  • But … 5,000 is minuscule compared to 10,000,000 or 1,000,000,000. And, when compared to total fleet exposure times, this is less than:
    • A few days of flight time for any popular aircraft type (e.g. B737, A320)
    • One day of drive time for any popular automobile type

We cannot rely on our experience-based intuition to determine whether a failure can happen within required probability limits [End]

References

Driscoll, K., Murphy Was an Optimist, SAFECOMP 2010.

Downer, J., Anatomy of a Disaster: Why Some Accidents are Unavoidable, March 2010.

No Comments

Be the first to start the conversation!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s