The Risk of losing any sum is the reverse of Expectation; and the true measure of it is, the product of the Sum adventured multiplied by the Probability of the Loss.
Abraham de Moivre, De Mensura Sortis, 1711 in the Ph. Trans. of the Royal Society
One of the perennial challenges of system safety is that for new systems we generally do not have statistical data on accidents. High consequence events are, we hope, quite rare leaving us with a paucity of information. So we usually end up basing any risk assessment upon low base rate data, and having to fall back upon some form of subjective (and qualitative) method of risk assessment. Risk matrices were developed to guide such qualitative risk assessments and decision making, and the form of these matrices is based on a mix of classical decision and risk theory. The matrix is widely described in safety and risk literature and has become one of the less questioned staples of risk management. Yet despite it’s long history you still see plenty of poorly constructed risk matrices out there, in both the literature and standards. So this post attempts to establish some basic principles of construction as an aid to improving the state of practice and understanding.
The iso-risk contour
Based on de Moivre’s definition we can define a series of curves that represent the same risk level (termed iso-risk contours) on a two dimensional surface. While this is mathematically correct it’s difficult for people to use in making qualitative evaluations. So decision theorists took the iso-risk contours and zoned them into judgementally tractable cells (or bins) to form the risk matrix. In this risk matrix each cell notionally represent a point on the iso-risk curve and steps in the matrix define the edges of ‘risk zones’. We can also usually plot the curve using log log axes which provide straight line contours which gives us a matrix that looks like the one below.
This binning is intended to make qualitative decisions as to severity or likelihood more tractable as human beings find it easier to select qualitative values from amongst such bins. But unfortunately binning also introduces ambiguity into the risk assessment, if you look at the example given above you’ll see that the iso-risk contour runs through the diagonal of the cell, so ‘off diagonal’ the cell risk is lesser or greater depending on which side of the contour your looking at. So be aware that when you bin a continuous risk contour you pay for easier decision making with by increasing the uncertainty of the underlying assigned risk.
Scaling likelihood and severity
The next problem that faces us in developing a risk matrix is assigning a scale to the severity and likelihood bins. In the example above we have a linear (log log) plot so the value of each succeeding bin’s median point should go up by an order of magnitude. In qualitative terms this defines an ordinal scale ‘probable’ in the example above to be an order of magnitude more likely as ‘remote’, while ‘catastrophic’ another order of magnitude greater in severity than ‘critical’ and so on. There are two good reasons to adopt such a scaling. The first is that it’s an established technique to avoid qualitative under-estimation of values, people generally finding it easier to discriminate between values separated by an order of magnitude than by a linear scale. The second is that if we have a linear iso-risk matrix then (by definition) the scales should also be logarithmic to comply with De Moivre’s equation. Unfortunately, you’ll also find plenty of example linear risk contour matrices that unwittingly violate de Moivre’s equation. Australian Standard AS 4360 as an example. While such matrices may reflect a conscious decision making strategy, for example sensitivity to extreme severities, they don’t reflect De Moivre’s theorem and the classical definition of risk so where you depart you need to be clear as to why you are (dealing with non-ergodic risks is a good one).
Cell numbers and the 4 to 7 rule
Another decision with designing a risk matrices is how many cells to have, too few and the evaluation is too granular, too many and the decision making becomes bogged down in detailed discrimination (which as noted above is hardly warranted). The usual happy compromise sits at around 4 to 7 bins on the vertical and horizontal. Adopting a logarithmic scale gives a wide field of discrimination for a 4 to 7 matrix, but again there’s a trade-off, this time between sizing the bins in order to reduce cognitive workload for the assessor and the information lost by doing so.
The semiotics of colour
The problem with a matrix is that it can only represent two dimensions of information on one graph. Thus a simple matrix may allow us to communicate risk as a function of frequency (F) and severity (S) but we still need a way to graphically associate decision rules with the identified risk zones. The traditional method adopted to do this is to use colour to designate the specific risk zones and a colour key the associated action required. As the example above illustrates the various decision zones of risk are colour coded, with the highest risks being given the most alarming colours. While colour has inherently a strong semiotic meaning, and one intuitively understood by both expert and lay person alike, there is also a potential trap in that by setting the priorities using such a strong code we are subtly forcing an imperative. In these circumstances a colourised matrix can become a tool of persuasion rather than one of communications (Roth 2012). One should therefore carefully consider what form of communication the matrix is intended to support.
Ensure consistency of ordering
A properly formed matrices risk zones should also exhibit what is called ‘weak consistency’, that is the qualitative ordering of risk as defined by the various (coloured) zones and their cells ranks various risks (form high to low) in roughly the same way that a quantitative analysis would do so (Cox 2008). In practical terms what this means is that if you find there are areas of the matrix where the same effort will produce a greater or lesser improvement of risk when compared to another area (Clements 96) you have a problem of construction. You should also (in principal) never be able to jump across two risk decision zones in one movement. For example if a mitigation only reduces the likelihood of occurrence of a hazard only we wouldn’t expect the risk reduction to change as the severity of the risk changed.
Dealing with boundary conditions
In some standards, such as MIL-STD-882, an upper arbitrary bound may be placed on severity, in which case what happens when a severity exceeds the ‘allowable’ threshold? For example, should we discriminate between a risk to more than one person versus one to a single person? For specific domains this may not be a problem, but for others where mass casualty events are a possibility it may well be of concern. In this case if we don’t wish to add columns to the matrix we may define a sub-cell within our matrix to reflect that this is a higher than we thought level of risk. Alternatively we could define the severity for the ‘I’ column as defining a range of severities whose values include at the median point the mandated severity. So for example the catastrophic hazard bin range would be from 1 fatality to 10 fatalities. Looking at likelihood one should also include a likelihood of ‘incredible’ for risks that have been removed, and where there is no credible likelihood of their occurrence, can be recorded rather than deleted. After all just because we’ve retired a hazard today, doesn’t mean a subsequent design change or design error won’t resurrect the hazard to haunt us.
Calibrating the risk matrix
Risk matrices are used to make decisions about risk, and their acceptability. But how do we calibrate the risk matrix to represent an understandably acceptable risk? One way is to pick a specific bin and establish a calibrating risk scenario for it, usually drawn from the real world, for which we can argue the risk is considered broadly acceptable by society (Clements 96). So in the matrix above we could pick cell IE and equate that to an acceptable real world risk that could result in the death of a person (a catastrophic loss). For example, ‘the risk of death in an motor vehicle accident on the way from and to your work on main arterial roads under all weather conditions cumulatively over a 25 year working career‘. This establishes the edge of the acceptable risk zone by definition and allows use to define other risk zones. In general it’s always a good idea to provide a description of what each qualitative bin means so that people understand the meaning. If you need to one can also include numerical ranges for likelihood and severity, such as the loss values in dollars, numbers of injuries sustained and so on.
Define your exposure
One should also consider and define the number of units, people or systems exposed, clearly there is a difference between the cumulative risk posed by say one aircraft and a fleet of one hundred aircraft in service or between one bus and a thousand cars. What may be acceptable at an individual level (for example road accident) may not be acceptable at an aggregated or societal level and risk curves may need to be adjusted accordingly. MIL-STD-882C offers a simple example of this approach.
And then define your duration
Finally and perhaps most importantly you always need to define the duration of exposure for likelihood. Without it the statement is at best meaningless and at worst misleading as different readers will have different interpretations. A 1 in 100 probability of loss over 25 years of life is a very different risk to a 1 in 100 over a single eight hour flight.
A risk matrix is about making decisions so it needs to support the user in that regard, but, it’s use as part of a risk assessment should not be seen as a means of acquitting a duty of care. The principle of ‘so far as is reasonable practicable’ cares very little about risk in the first instance, asking only whether it would be considered reasonable to acquit the hazard. Risk assessments belong at the back end of a program when, despite our efforts we have residual risks to consider as part of evaluating our efforts in achieving a reasonably practicable level of safety. A fact that modern decision makers should keep firmly in mind.
Clements, P. Sverdrup System Safety Course Notes, 1996.
Cox, L.A. Jr., ‘What’s Wrong with Risk Matrices?’, Risk Analysis, Vol. 28, No. 2, 2008.
Cox, S., Tait, R., Safety, Reliability & Risk Management, 2nd Ed., Butterworth, Heinemann, 1998.
MIL-STD-882, System Safety Program Requirements.
Leveson, N., System Safety and Computers – A Guide to Preventing Accidents and Losses Caused by Technology, Addison Wesley, 1995.
Roth, Florian., Focal Report 9: Risk Analysis Visualizing Risk: The Use of Graphical Elements in Risk Analysis and Communications, Risk and Resilience Research Group Center for Security Studies (CSS), Zürich 2012.