With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from this post written in the aftermath of the 2009 Victorian Black Saturday fires.
So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…
So how did we get to this? Simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the occurrence rate of extremely dry conditions and the heavy tail of bushfire severity. Yes we’ve been twisting the dragon’s tail and now it’s woken up…
2019 Postscript: Monday 11 November 2019 – NSW
And here we are in 2019 two years down the track from the fires of 2017 and tomorrow looks like being a beyond catastrophic fire day. Firestorms are predicted.
To err is human, but to really screw it up takes a team of humans and computers…
How did a state of the art cruiser operated by one of the worlds superpowers end up shooting down an innocent passenger aircraft? To answer that question (at least in part) here’s a case study that’s part of the system safety course I teach that looks at some of the casual factors in the incident.
In the immediate aftermath of this disaster there was a lot of reflection, and work done, on how humans and complex systems interact. However one question that has so far gone unasked is simply this. What if the crew of the USS Vincennes had just used the combat system as it was intended? What would have happened if they’d implemented a doctrinal ruleset that reflected the rules of engagement that they were operating under and simply let the system do its job? After all it was not the software that confused altitude with range on the display, or misused the IFF system, or was confused by track IDs being recycled… no, that was the crew.
If you’re interested in observation selection effects Nick Bostrum’s classic on the subject is (I now find out) available online here. A classic example of this is Wald’s work on aircraft survivability in WWII, a naive observer would seek to protect those parts of the returning aircraft that were most damaged, however Wald’s insight was that these were in fact the least critical areas of the aircraft and that the area’s not damaged should actually be the one’s that were reinforced.
I’ve just finished reading an interesting post by Andrew Rae on the missing aspects of engineering education (Mind the Feynman gap) which parallels my more specific concerns, and possibly unkinder comments, about the lack of professionalism in the software community.
One of the trophes I’ve noticed in design projects over the years is the tendency of engineers to instinctively jump from need to a singular conceptual solution. Unfortunately that initial solution rarely stands the test of time, and inevitably at some crisis point there’s a recognition that this will not work and the engineers go back to change the concept, often junking it completely.
Gregory (Scotland Yard detective): “Is there any other point to which you would wish to draw my attention?” Holmes: “To the curious incident of the dog in the night-time.” Gregory: “The dog did nothing in the night-time.” Holmes: “That was the curious incident.”
What you pay attention to dictates what you’ll miss
The point that the great detective was making was that the absence of something was the evidence which the Scotland Yard detective had overlooked. Holmes of course using imagination and intuition did identify that this was in fact the vital clue. Such a plot device works marvelously well because almost all of us, like detective Gregory, fail to recognise that such an absence is actually ‘there’ in a sense, let alone that it’s important.
In a slight segue, I was reading Bruce Schneier’s blog on security and came across this post on the psychology behind fraud. Bruce points to this post on why, yes I know, ‘good people do bad things’. The explanation that researchers such as Ann Tenbrunsel of Notre Dame offer is that in the same way that we are boundedly rational in other aspects of decision making so to are our ethical decisions.
In particular, the way in which decision problems were framed seems to have a great impact upon how we make decisions. Basically if a problem was framed without an ethical dimension then decision makers were much less likely to consider that aspect.
Additionally to framing effects, researchers found in studying collusion in fraud cases most people seem to act from an honest desire simply to help others, regardless of any attendant ethical issues.
What fascinates me is how closely such research parallels the work in system safer and human error. Clearly if management works within a frame based upon performance and efficiency, they are simply going to overlook the down side completely, and in a desire to be helpful why everyone else ‘goes along for the ride’.
There is as I see it a concrete recommendation that come out of this research that we can apply to safety; that fundamentally safety management systems need to be designed to take account of of our weaknesses as boundedly rational actors.
One of the perennial issues in regulating the safety of technological systems is how prescriptively one should write the regulations. At one end of the spectrum is a rule based approach, where very specific norms are imposed and at least in theory there is little ambiguity in either their interpretation or application. At the other end you have performance standards, which are much more open-ended, allowing a regulator to make circumstance specific determinations as to whether the standard has been met. Continue Reading…
The NTSB has released it’s interim report on the Boeing 787 JAL battery fire and it appears that Boeing’s initial safety assessment had concluded that the only way in which a battery fire would eventuate was through overcharging. Continue Reading…
Presumably the use of the crew cab as an escape pod was not actually high on the list of design goals for the 4000 and 4100 class locomotives, and thankfully the locomotives involved in the recent derailment at Ambrose were unmanned.
Earlier reports of the Boeing 787 lithium battery initial development indicated that Boeing engineers had conducted tests to confirm that a single cell failure would not lead to a cascading thermal runaway amongst the remaining batteries. According to these reports their tests were successful, so what went wrong?
One of the things that’s concerned me for a while is the potentially malign narrative power of a published safety case. For those unfamiliar with the term, a safety case can be defined as a structured argument supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment. And I have not yet read a safety case that didn’t purport to be exactly that.
The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.
In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.
The “‘Oh #%*!”, moment captured above definitely qualifies for the vigorous application of the rule that when the fire’s too hot, the water’s too deep or the smoke’s too thick leave. 🙂
But in fact in this incident the pilot actually had to convince the navigator that he needed to leave ‘right now!’. The navigator it turned out was so fixated on shutting down the aircrafts avionics system he didn’t realise how bad thing were, nor recognise that immediate evacuation was the correct response.
The Risk of losing any sum is the reverse of Expectation; and the true measure of it is, the product of the Sum adventured multiplied by the Probability of the Loss.
Abraham de Moivre, De Mensura Sortis, 1711 in the Ph. Trans. of the Royal Society
One of the perennial challenges of system safety is that for new systems we generally do not have statistical data on accidents. High consequence events are, we hope, quite rare leaving us with a paucity of information. So we usually end up basing any risk assessment upon low base rate data, and having to fall back upon some form of subjective (and qualitative) method of risk assessment. Risk matrices were developed to guide such qualitative risk assessments and decision making, and the form of these matrices is based on a mix of classical decision and risk theory. The matrix is widely described in safety and risk literature and has become one of the less questioned staples of risk management. Yet despite it’s long history you still see plenty of poorly constructed risk matrices out there, in both the literature and standards. So this post attempts to establish some basic principles of construction as an aid to improving the state of practice and understanding.
The iso-risk contour
Based on de Moivre’s definition we can define a series of curves that represent the same risk level (termed iso-risk contours) on a two dimensional surface. While this is mathematically correct it’s difficult for people to use in making qualitative evaluations. So decision theorists took the iso-risk contours and zoned them into judgementally tractable cells (or bins) to form the risk matrix. In this risk matrix each cell notionally represent a point on the iso-risk curve and steps in the matrix define the edges of ‘risk zones’. We can also usually plot the curve using log log axes which provide straight line contours which gives us a matrix that looks like the one below.
This binning is intended to make qualitative decisions as to severity or likelihood more tractable as human beings find it easier to select qualitative values from amongst such bins. But unfortunately binning also introduces ambiguity into the risk assessment, if you look at the example given above you’ll see that the iso-risk contour runs through the diagonal of the cell, so ‘off diagonal’ the cell risk is lesser or greater depending on which side of the contour your looking at. So be aware that when you bin a continuous risk contour you pay for easier decision making with by increasing the uncertainty of the underlying assigned risk.
Scaling likelihood and severity
The next problem that faces us in developing a risk matrix is assigning a scale to the severity and likelihood bins. In the example above we have a linear (log log) plot so the value of each succeeding bin’s median point should go up by an order of magnitude. In qualitative terms this defines an ordinal scale ‘probable’ in the example above to be an order of magnitude more likely as ‘remote’, while ‘catastrophic’ another order of magnitude greater in severity than ‘critical’ and so on. There are two good reasons to adopt such a scaling. The first is that it’s an established technique to avoid qualitative under-estimation of values, people generally finding it easier to discriminate between values separated by an order of magnitude than by a linear scale. The second is that if we have a linear iso-risk matrix then (by definition) the scales should also be logarithmic to comply with De Moivre’s equation. Unfortunately, you’ll also find plenty of example linear risk contour matrices that unwittingly violate de Moivre’s equation. Australian Standard AS 4360 as an example. While such matrices may reflect a conscious decision making strategy, for example sensitivity to extreme severities, they don’t reflect De Moivre’s theorem and the classical definition of risk so where you depart you need to be clear as to why you are (dealing with non-ergodic risks is a good one).
Cell numbers and the 4 to 7 rule
Another decision with designing a risk matrices is how many cells to have, too few and the evaluation is too granular, too many and the decision making becomes bogged down in detailed discrimination (which as noted above is hardly warranted). The usual happy compromise sits at around 4 to 7 bins on the vertical and horizontal. Adopting a logarithmic scale gives a wide field of discrimination for a 4 to 7 matrix, but again there’s a trade-off, this time between sizing the bins in order to reduce cognitive workload for the assessor and the information lost by doing so.
The semiotics of colour
The problem with a matrix is that it can only represent two dimensions of information on one graph. Thus a simple matrix may allow us to communicate risk as a function of frequency (F) and severity (S) but we still need a way to graphically associate decision rules with the identified risk zones. The traditional method adopted to do this is to use colour to designate the specific risk zones and a colour key the associated action required. As the example above illustrates the various decision zones of risk are colour coded, with the highest risks being given the most alarming colours. While colour has inherently a strong semiotic meaning, and one intuitively understood by both expert and lay person alike, there is also a potential trap in that by setting the priorities using such a strong code we are subtly forcing an imperative. In these circumstances a colourised matrix can become a tool of persuasion rather than one of communications (Roth 2012). One should therefore carefully consider what form of communication the matrix is intended to support.
Ensure consistency of ordering
A properly formed matrices risk zones should also exhibit what is called ‘weak consistency’, that is the qualitative ordering of risk as defined by the various (coloured) zones and their cells ranks various risks (form high to low) in roughly the same way that a quantitative analysis would do so (Cox 2008). In practical terms what this means is that if you find there are areas of the matrix where the same effort will produce a greater or lesser improvement of risk when compared to another area (Clements 96) you have a problem of construction. You should also (in principal) never be able to jump across two risk decision zones in one movement. For example if a mitigation only reduces the likelihood of occurrence of a hazard only we wouldn’t expect the risk reduction to change as the severity of the risk changed.
Dealing with boundary conditions
In some standards, such as MIL-STD-882, an upper arbitrary bound may be placed on severity, in which case what happens when a severity exceeds the ‘allowable’ threshold? For example, should we discriminate between a risk to more than one person versus one to a single person? For specific domains this may not be a problem, but for others where mass casualty events are a possibility it may well be of concern. In this case if we don’t wish to add columns to the matrix we may define a sub-cell within our matrix to reflect that this is a higher than we thought level of risk. Alternatively we could define the severity for the ‘I’ column as defining a range of severities whose values include at the median point the mandated severity. So for example the catastrophic hazard bin range would be from 1 fatality to 10 fatalities. Looking at likelihood one should also include a likelihood of ‘incredible’ for risks that have been removed, and where there is no credible likelihood of their occurrence, can be recorded rather than deleted. After all just because we’ve retired a hazard today, doesn’t mean a subsequent design change or design error won’t resurrect the hazard to haunt us.
Calibrating the risk matrix
Risk matrices are used to make decisions about risk, and their acceptability. But how do we calibrate the risk matrix to represent an understandably acceptable risk? One way is to pick a specific bin and establish a calibrating risk scenario for it, usually drawn from the real world, for which we can argue the risk is considered broadly acceptable by society (Clements 96). So in the matrix above we could pick cell IE and equate that to an acceptable real world risk that could result in the death of a person (a catastrophic loss). For example, ‘the risk of death in an motor vehicle accident on the way from and to your work on main arterial roads under all weather conditions cumulatively over a 25 year working career‘. This establishes the edge of the acceptable risk zone by definition and allows use to define other risk zones. In general it’s always a good idea to provide a description of what each qualitative bin means so that people understand the meaning. If you need to one can also include numerical ranges for likelihood and severity, such as the loss values in dollars, numbers of injuries sustained and so on.
Define your exposure
One should also consider and define the number of units, people or systems exposed, clearly there is a difference between the cumulative risk posed by say one aircraft and a fleet of one hundred aircraft in service or between one bus and a thousand cars. What may be acceptable at an individual level (for example road accident) may not be acceptable at an aggregated or societal level and risk curves may need to be adjusted accordingly. MIL-STD-882C offers a simple example of this approach.
And then define your duration
Finally and perhaps most importantly you always need to define the duration of exposure for likelihood. Without it the statement is at best meaningless and at worst misleading as different readers will have different interpretations. A 1 in 100 probability of loss over 25 years of life is a very different risk to a 1 in 100 over a single eight hour flight.
A risk matrix is about making decisions so it needs to support the user in that regard, but, it’s use as part of a risk assessment should not be seen as a means of acquitting a duty of care. The principle of ‘so far as is reasonable practicable’ cares very little about risk in the first instance, asking only whether it would be considered reasonable to acquit the hazard. Risk assessments belong at the back end of a program when, despite our efforts we have residual risks to consider as part of evaluating our efforts in achieving a reasonably practicable level of safety. A fact that modern decision makers should keep firmly in mind.
Clements, P. Sverdrup System Safety Course Notes, 1996.
What the economic theory of sunk costs tells us about plan continuation bias
Plan continuation bias is a recognised and subtle cognitive bias that tends to force the continuation of an existing plan or course of action even in the face of changing conditions. In the field of aerospace it has been recognised as a significant causal factor in accidents, with a 2004 NASA study finding that in 9 out of the 19 accidents studied aircrew exhibited this behavioural bias. One explanation of this behaviour may be a version of the well known ‘sunk cost‘ economic heuristic.
What the Cry Wolf effect tells us about pilot’s problems with unreliable air data
In a recurring series of incidents air crew have consistently demonstrated difficulty in firstly identifying and then subsequently dealing with unreliable air data and warnings. To me figuring out why this difficulty occurs is essential to addressing what has become a significant issue in air safety. Continue Reading…
A near disaster in space 40 years ago serves as a salutory lesson on Common Cause Failure (CCF)
Two days after the launch of Apollo 13 an oxygen tank ruptured crippling the Apollo service module upon which the the astronauts depended for survival, precipitating a desperate life or death struggle for survival. But leaving aside what was possibly NASA’s finest hour, the causes of this near disaster provide important lessons for design damage resistant architectures.
The Washington Post has discovered that concerns about the vulnerability of the Daiichi Fukushima plant to potential Tsunami events were brushed aside at a review of nuclear plant safety conducted in the aftermath of the Kobe earthquake. Yet at other plants the Japanese National Institute of Advanced Industrial Science and Technology (NISA) had directed the panel of engineers and geologists to consider tsunami events.
Often times we make decisions as part of a group and in the environment of the group there is a strong possibility that the cohesiveness of the group leads members to minimise interpersonal conflict and reach a consensus at the expense of crticially evaluating and testing ideas.
In a series of occasional posts on this blog, I’ve discussed some of the pitfalls of heuristics based decision making as well as the risks associated with decision making on incomplete information or in an environment of time pressure. As an aid to the reader I’ve provided a consolidated list here.
People tend to seek out and interpret information that reinforces their beliefs, especially in circumstances of uncertainty or when there is emotion attaching to the issue. This bias is known as confirmatory or ‘myside’ bias. So what can you do to guard against the internal ‘yes man’ that is echoing back your own beliefs?
As the latin root of the word risk indicates an integral part of risk taking is the benefit we achieve. However often times decision makers do not have a clear understanding of what is the upside or payoff.
Disappointingly the Black Saturday royal commission report makes no mention of the effect of cognitive biases upon making a ‘stay or go’ decision, instead assuming that such decisions are made in a completely rationa fashion. As Black Saturday and other disasters show this is rarely the case.
So, a year on from the Black Saturday fires and the royal commission established in their aftermath is working it’s way to a conclusion. While the commission has certainly been busy, I guess you could say that I was left unsatisfied by the recommendations.
We live under a government of men and morning newspapers.
While Smith and Madden’s argument turns out to be the usual denialist slumgullion it does serve as a useful jump off point for a discussion of the role of the media in propagating such pernicious memes (1) and more broadly in communicating risk. Continue Reading…
Fire has been an integral part of the Australian ecosystem for tens of thousands of years. Both the landscape and it’s native inhabitants have adapted to this periodic cycle of fire and regeneration. These fires are not bolts from the blue, they occur regularly and predictably, yet modern Australians seem to have difficulty understanding that their land will burn, regularly, and sometimes catastrophically.
So why do we studiously avoid serious consideration of the hazards of living in a country that regularly produces firestorms? Why, in the time of fire, do we go through the same cycle of shock, recrimination, exhortations to do better, diminishing interest and finally forgetfulness?
With a Bachelor’s in Mechanical Engineering and a Master’s in Systems Engineering, Matthew Squair is a principal consultant with Jacobs Australia. His professional practice is the assurance of safety, software and cyber-security, and he writes, teaches and consults on these subjects. He can be contacted at firstname.lastname@example.org