Simply put, it is possible to have convenience if you want to tolerate insecurity, but if you want security, you must be prepared for inconvenience.

Gen. Benjamin Chidlaw (1954)

It is highly questionable whether total system safety is always enhanced by allocating functions to automatic devices rather than human operators, and there is some reason to believe that flight-deck automation may have already passed its optimum point.

Earl Wiener (1980)

If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.

When you look at the safety performance of industries which have a consistent focus on safety as part of the social permit, nuclear or aviation are the canonical examples, you see that over time increases in safety tend to plateau out. This looks like some form of a learning curve, but what’s the mechanism, or mechanisms that actually drives this process?

There are actually two factors at play here, firstly the increasing marginal cost of improvement and secondly the problem of learning from rare events. In the first case increasing marginal cost is simply an economist’s way of stating that it will cost more to achieve that next increment in performance. For example, airbags are more expensive than seat-belts by roughly an order of magnitude (based on replacement costs) however airbags only deliver 8% reduced mortality when used in conjunction with seat belts, see Crandall (2001). As a result the next increment in safety takes longer and costs more (1).

The second case is a more subtle version of the first. As we reduce accident rates accidents become rarer. Now one of the traditional ways in which safety improvements occur is through studying accidents when they occur and then to eliminate or mitigate identified causal factors. Obviously as the accident rate decreases this likewise the opportunity for improvement also decreases. When accidents do occur we have a further problem because (definitionally) the cause of the accident will comprise a highly unlikely combination of factors that are needed to defeat the existing safety measures. Corrective actions for such rare combination of events therefore are highly specific to that events context and conversely will have far less universal applicability.  For example the lessons of metal fatigue learned from the Comet airliner disaster has had universal applicability to all aircraft designs ever since. But the QF-72 automation upset off Learmouth? Well those lessons, relating to the specific fault tolerance architecture of the A330, are much harder to generalise and therefore have less epistemic strength.

In summary not only does it cost more with each increasing increment of safety but our opportunity to learn through accidents is steadily reduced as their arrival rate and individual epistemic value (2) reduce.

Notes

1. In some circumstances we may also introduce other risks, see for example the death and severe injury caused to small children from air bag deployments.

2. In a Popperian sense.

References

1. Crandall, C.S., Olson, L.M.,  P. Sklar, D.P., Mortality Reduction with Air Bag and Seat Belt Use in Head-on Passenger Car Collisions, American Journal of Epidemiology, Volume 153, Issue 3, 1 February 2001, Pages 219–224, https://doi.org/10.1093/aje/153.3.219.

For a long while, possibly forever, the assertion that it was impossible test for the reliability of a system meeting the mythical 10e-9 failure per hour has dominated thinking about high integrity systems. But maybe this is just not true. You see this statement itself relies on a critical assumption about the distribution of failure inter arrival times. Traditional reliability theory, wherein this idea comes from, is based on the assumption of a memoryless process where the likelihood of failure is constant. However this is just an assumption and we have no reason to believe (a priori) that is true, yes it makes the maths easier, but so what? That doesn’t mean it’s actually ‘true’ even if it is convenient.

What if instead we posit that the distribution of the inter arrival times of failures is actually heavy tailed, thus as time passes failure gets less likely and that instead of the hazard rate increasing as our exposure increases it actually decreases! The canonical example is expecting a  response to an email, it’s quite likely in the first minutes and hours but as time passes that response gets less and less likely. Train delays are similar, we’d expect a train to arrive within a couple of minutes but as the wait gets longer and longer, well it’s arrival gets less and less likely. Technically this is called a heavy tail explosion, and as a model it may well fit better the occurrence distribution of latent design faults.

If failure inter-arrival times (we’re focusing on design faults in particular) follow a heavy tail distribution then guess what? Turns out that testing could actually be competitive against other assurance strategies in terms of cost per pound of assurance. This of course  also throws into doubt a major justification for the integrity level, design assurance industry. All because of a simple assumption about failure distributions.

AI winter

Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.