The continuum of uncertainty
What do an eighteenth century mathematician and a twentieth century US Secretary of Defence have to do with engineering and risk? The answer is that both thought about uncertainty and risk, and the differing definitions that they arrived at neatly illustrate that there is more to the concept of risk than just likelihood multiplied by consequence. Which in turn has significant implications for engineering risk management.
Editorial note. I’ve pretty much completely revised this post since the original, hope you like it
The concept of risk allows us to make decisions in an uncertain world where we cannot perfectly predict future outcomes. Yet when we talk about risk is it always the same thing? A good example is if we ask someone to predict the outcome of a coin toss, a natural first response would be to say that the probability of heads and tails is equally likely. And this interpretation of uncertainty as probability is certainly the one that De Moivre made in defining his famous risk equation. But what if the coin has been biased or the coin is thrown where we can’t see it? What can we really say about the probability in these circumstances? In fact what we based our initial response on was an assumption about the world, which may be valid, or may not, but valid or not we are forced to make an assumption because that’s where our knowledge stops.
Which leads us to the point that Secretary Rumsfeld was making, that uncertainty is a continuum that ranges from that which we are certain of at one end through to that which we are completely ignorant of at the other. For the purposes of discussion we can usefully divide this continuum into three broad categories, aleatory, epistemic and ontological uncertainty.
At one end of the spectrum of knowing we have those things that we know and are confident we know. So for example there might be some parameter of a process that we are interested in, we collect the evidence and on the basis of the evidence are confident that we know ‘all’ about it. But our process may not be deterministic we may in fact be looking at a stochastic or random process, where there is some irresolvable indeterminacy about how the process will evolve over time. Again we can collect some data about this randomness or aleatory (statistical) uncertainty and express it as a probability distribution and it’s moments.
To use the coin toss example having thrown the coin a thousand times we would be able to express with great confidence via a probability distribution the probability of a head or tail occurring. Paradoxically that’s all we can say about the next coin toss. While we can characterise aleatory uncertainty very well it also represents an irreducible boundary to our knowledge.
From a safety management perspective the obvious conclusion to draw is that if there is an aleatory risk then if the system is operated for long enough (or in enough copies) all we can say is that eventually you will have an accident. Thus risk acceptance in such circumstances is really about whether the accident rate is acceptable over some defined duration of exposure.
Epistemic uncertainty on the other hand represents a more general and resolvable lack of knowledge. We lack in knowledge, but are aware of our lack, so that the opportunity presents itself for us to reduce this uncertainty. Taking our coin toss example we don’t know whether the coin is true in the first instance, but if we start to toss the coin we immediately get a feel for it’s trueness, the longer we toss the coin the better our feel for the ‘true’ of the coin and the greater the reduction in epistemic uncertainty.
If we choose to make an assumption however, in the above case that the coin was true, we accept a level of epistemic uncertainty (2). We may make that assumption because we lack data, or to simplify a problem, but in either case the uncertainty we introduce also carries risk. The first risk is that the assumption is plain wrong, the second and more subtle is that assumptions often reflect some specific context and if you change the context the assumption no longer holds.
Epistemic uncertainty is a significant problem for safety analyses because while some data may be ‘hard’, such as the failure rates of specific components, other data is more uncertain, for example human error rates or the distribution of environmental effects. Epistemic uncertainty becomes even more dominant when we are asked to evaluate the likelihood of a rare event for which little or no empirical data exists, and for which we become more reliant on subjective ‘expert estimation’ with all the potential for biases that this entails. Neglecting such limitations and biases, can lead in turn to such results as ‘tail truncation’ where the likelihood of extreme events is significantly underestimated, or discounted completely. In such circumstances considering the future from a possibilistic rather than probabilistic stand-point may turn out to be the wiser course.
Ontological uncertainty lies at the far end of our continuum and represents a state of complete ignorance. Not only do we not know, but we don’t even know what we don’t know. While the truth is out there, we cannot access it because we simply don’t know where to look in the first instance. Here my coin toss analogy breaks down and we may have to turn to a hypothetical game of cards to illustrate. Say you sit down to play a game of cards with a number of other players, you don’t know the rules or the suites of cards in play but as you play you start to get a feel for the rules and the value of the cards. But, unknown to you, the dealer holds the single ‘black swan’ card, which he can play against a specific although rare turn of hands. You think you know all the cards and have correctly inferred the rules, certainly the long run of hands that you have seen supports your ‘theory of the game’, but the dealer can still play the black swan card. This illustrates the classic decision making quandary of ‘we don’t know what we don’t know’. Can we peer into the darkness and determine the unkowable risk? Maybe, maybe not, but if we can’t then we might take heed of Collingridge try to pursue corrigibility in our decisions.
Where we do things for the first time and operate in a state of ignorance, NASA calls it operating outside the experience base, we are creating a situation ripe with ontological uncertainty and risk. A good example of how such uncertainties can come home to roost is the Tacoma Narrows disaster. In designing the bridge the designers were unaware of the aerodynamic flutter effect that the bridge’s span to width ratio (much greater than any previous bridge) was vulnerable to. In general when we introduce new technologies we are working with concepts, principles and techniques that we are fundamentally unfamiliar with, and they carry a higher degree of ontological uncertainty than precedented mature technologies. In such circumstances, applying a precautionary principle may be advisable.
A subtler problem with assumptions is that often they are implicit rather than explicit, thus they become an invisible and unquestioned part of the landscape, the unknown ‘knowns’ of Don Rumsfeld’s risk model. Invisible that is, until the system’s context changes an the assumptions become invalid See the Ariane 501 mission as a poster child for this effect. Being ‘implicit’ they are not easily accessible to us, and it’s this lack of visibility that transforms epistemic uncertainty (an original assumption) into ontological uncertainty (a hidden assumption). Undetected design errors in what is assumed to be a correct design can also be seen as another specific example of ontological uncertainty. To tackle the risk that assumptions pose we may consider applying the better wrong than vague principle.
Where knowledge is available ‘somewhere’, but we have no way to access it due to a failure in communication such failures can breed ignorance, ontological uncertainty and therefore increased risk. An example of such a failure is the breakdown of communication between engineers and managers that happened in the months leading up to the Challenger launch decision.
Perhaps the greatest challenges in safety management is the insoluble problem of determining whether we have completely identified all the hazards in a system. This is in reality a specific example of Hume’s problem of induction, as Hume pointed out there is no way to logically prove the general case from any limited set of observations. The completeness problem is in turn compounded by a cognitive weakness called the omission neglect effect, that is we tend to focus on what we see and fail to think about what is unseen. Couple this with the problem of computing probabilities for rare events, such as truly catastrophic accidents, as well as the narrative fallacy and we end up with events that when they occur, are both unexpected and disproportionate in effect. Nassim Taleb spends a lot of time considering the risk of these Black Swan events in their various guises.
The role of complexity and scale
Complexity in and of itself is not a direct cause of accidents but what complexity does do is breed epistemic and ontological uncertainty. Complex systems are difficult to completely understand and model (4), usually contain more assumptions, which are usually also implicit, and are more likely to contain un-identified design errors. See the ‘Toyota spaghetti monster disaster‘ for an example of what nightmares unbounded complicatedness can breed. All of which leads to greater epistemic and ontological risk. So simplifying a system design has potentially more ‘bang for the buck’ in terms of enhancing safety. Allied to the problem of complexity is that of scale, the larger the set of things we have to consider the greater the effort, so to make such analyses tractable we should strive for economy of method. Pursuing both simplicity and economy of course has an obvious synergistic effect.
Dealing with the uncertainties
Because we express aleatory uncertainty as process variability over a series of trials aleatory risk is always expressed in relation to a duration of exposure. The classical response to such variability is to build in redundancy, such as backup components or additional design margins that over the duration of exposure reduce risk to an acceptable level. But with aleatory risk we also hit a fundamental limit of control. While we can reduce the risk exposure by, for example, introducing redundancy if we keep playing the game long enough eventually we’ll loose.
If epistemic and ontological uncertainty represents our lack of knowledge then reducing such risks requires us to improve our knowledge of the system of interest or avoid situations that increase these types of uncertainty. In reducing such uncertainty we are in essence seeking to reduce uncertainty in our model of system behaviour (ontology) or in the model’s parameters (epistemology) (3). Looking at these classes of uncertainty we see that the reduction of their associated risk provides a theoretical justification for some well worn principles of safety engineering e.g:
- focus on reducing the severity to reduce the impact of uncertainty over likelihood (E) and unknown causes (O),
- use multiple and diverse models or risk assessments (E,O),
- avoid complexity in order to reduce epistemic (E) and ontological uncertainty (O),
- adopt economy of method, to reduce the amount of things we have to consider (O),
- use components with well characterised performance and failure modes (E,O) (5),
- use well understood and rigorous design methods to minimise the introduction of error (O),
- adopt a better wrong than vague approach (O),
- explicitly document design assumptions so that their interaction with changing contexts can be considered (O),
- use frameworks to identify possible holes in our models (O),
- cultivate safety imagination as an aid to overcome organisational framing effects (O),
- use small accidents and near misses to tell us where the edge of the safety envelope is (E,O),
- If it’s reasonably practicable to take a precaution, take it (E,O),
- engage in horizon scanning and alertness to small changes as indicators of gaps in our knowledge (O), or
- recognise where we are breaking new ground and pursue corrigibility of design decisions (O).
Some tentative conclusions
The difference between these types of risk goes some of the way towards explaining both our early successes in improving safety and the challenges that currently face us. Initially we focused upon aleatory risk presented by the random failure of system components. Through the improvement of reliability, use of redundant components and increased design margins to handle environmental variation significant gains in safety were made. But as the safety of systems improved through the reduction in aleatory risk epistemic and ontological risks become a proportionally greater causes of system accidents. At the same time our systems have grown in both complexity and size to meet the demands for increased capability, thereby increasing the potential sources of ontological uncertainty (6).
1. Uncertainty can be over the parameters of a model (e.g. a failure probability) or the incompleteness of the model itself (e.g. neglecting common cause effects in a fault tree).
2. These set of of assumptions about the world form the systems ‘context’ of use. If we change the context of use we may invalidate the assumptions.
3. Model uncertainty is, using Don Rumsfelds terms, what we ‘don’t know we don’t know’ about the process (if we did we’d include it in the model) while parameter uncertainty represents what we ‘know we don’t know’ about a process.
4. They belong to a class of systems exhibiting organised complexity. Because of the combination of their complexity and low numbers we can neither apply simplifying assumptions nor traditional statistical methods to establish the risk of operation. To understand systems of organised complexity we therefore end up constructing models of the system and relying on simulation to predict their behaviour.
5. For example, in some safety applications the use of electro-mechanical switches rather than solid state circuits is preferred because signal connections and potential failure modes are more visible.
6. For example the introduction of active redundancy or fault tolerance schemes to meet a dependability target carries with it additional complexity and therefore epistemic and/or ontological risk.