The aircraft has gone down in an area which is the undersea equivalent of the eastern slopes of the Rockies, well before anyone mapped them. Add to that a search area of thousands of square kilometres in about an isolated a spot as you can imagine, a search zone interpolated from satellite pings and you can see that it’s going to be tough.
Just received a text from my gas and electricity supplier. Good news! My gas and electricity bills will come down by about 4 and 8% respectively due to the repeal of the carbon tax in Australia. Of course we had to doom the planetary ecosystem and condemn our children to runaway climate change but hey, think of the $550 we get back per year. And, how can it get any better, now we’re also seen as a nation of environmental wreckers. I think I’ll go an invest the money in that AGL Hunter coal seam gas project, y’know thinking global, acting local. Thanks Prime Minister Abbott, thanks!
One of the underlying and unquestioned aspects of modern western society is that the power of the state is derived from a rational-legal authority, that is in the Weberian sense of a purposive or instrumental rationality in pursuing some end. But what if it isn’t? What if the decisions of the state are more based on belief in how people ought to behave and how things ought to be rather than reality? What, in other words, if the lunatics really are running the asylum?
As I was asked a question on risk homeostasis at the course I’m teaching, here without further ado is John Adam’s tour de force on The failure of seat belt legislation. Collectively, the group of countries that had not passed seat belt laws experienced a greater decrease than the group that had passed laws. Now John doesn’t directly draw the conclusion, but I will, that the seat belt laws kill more people than they save.
And it gets worse, in 1989 the British Government made seat belt wearing compulsory for children under 14 years old in the rear seats of cars, the result? In the year after there was an increase of almost 10% in the numbers of children killed in rear seats, and of almost 12% in the numbers injured (both above background increases). If not enacted there would be young adults now walking around today enjoying their lives, but of course the legislation was passed and we have to live with the consequences.
Now I could forgive the well intentioned who passed these laws, if when it became apparent that they were having a completely contrary effect they repealed them. But what I can’t forgive is the blind persistence, in practices that clearly kill more than they save. What can we make of this depraved indifference, other than people and organisations will sacrifice almost anything and anyone rather than admit they’re wrong?
As Weick pointed out, to manage the unexpected we need to be reliably mindful, not reliably mindless. Obvious as that truism may be, those who invest heavily in plans, procedures, process and policy also end up perpetuating and reinforcing a whole raft of expectations about how the world is, thus investing in an organisational culture of mindlessness rather than mindfulness. Understanding that process inherently elides to a state of organisational mindlessness, we can see that a process oriented risk management standard such as ISO 31000 perversely cultivates a climate of inattentiveness, right where we should be most attentive and mindful. Nor am I alone in my assessment of ISO 31000, see for example John Adams criticism of the standard as not fit for purpose , or Kaplan–Mike’s assessment of ISO 31000 essentially ‘not relevant‘. Process is no substitute for paying attention.
Don’t get me wrong there’s nothing inherently wrong with a small dollop of process, just that it’s place is not centre stage in an international standard that purports to be about risk, not if you’re looking for an effective outcome. In real life it’s the unexpected, those black swans of Nassim Taleb’s flying in the dark skies of ignorance, that have the most effect, and about which ISO 31000 has nothing to say.
Also the application of ISO 31000’s classical risk management to the workplace health and safety may actually be illegal in some jurisdictions (like Australia) where legislation is based on a backwards looking principle of due diligence, rather than a prospective risk based approach to workplace health and safety.
John Adams has an interesting take on the bureaucratic approach to risk management in his post reducing zero risk.
The problem is that each decision to further reduce an already acceptably low risk is always defended as being ‘cheap’, but when you add up the increments it’s the death of a thousand cuts, because no one ever considers the aggregated opportunity cost of course.
This remorseless slide of our public and private institutions into a hysteria of risk aversion seems to me to be be due to an inherent societal psychosis that nations sharing the english common law tradition are prone to. At best we end up with pointless safety theatre, at worst we end up bankrupting our culture.
I guess we’re all aware of the wave of texting while driving legislation, as well as recent moves in a number of jurisdictions to make the penalties more draconian. And it seems like a reasonable supposition that such legislation would reduce the incidence of accidents doesn’t it?
Interestingly the study is circa 2011 but I’ve seen no reflection in Australia on the uncomfortable fact that the study found, i.e that all we are doing with such schemes is shifting the death rate to an older cohort. Of course all the adults can sit back and congratulate themselves on a job well done, except it simply doesn’t work, and worse yet sucks resources and attention away from searching for more effective remedies.
In essence we’ve done nothing as a society to address teenage driving related deaths, safety theatre of the worst sort…
The testimony of Michael Barr, in the recent Oklahoma Toyota court case highlighted problems with the design of Toyota’s watchdog timer for their Camry ETCS-i throttle control system, amongst other things, which got me thinking about the pervasive role that watchdogs play in safety critical systems. The great strength of watchdogs is of course that they provide a safety mechanism which resides outside the state machine, which gives them fundamental design independence from what’s going on inside. By their nature they’re also simple and small scale beasts, thereby satisfying the economy of mechanism principle.
An interesting post by Ross Anderson on the problems of risk communication, in the wake of the savage storm that the UK has just experienced. Doubly interesting to compare the UK’s disaster communication during this storm to that of the NSW governments during our recent bushfires.
Or ‘On the breakdown of Bayesian techniques in the presence of knowledge singularities’
One of the abiding problems of safety critical ‘first of’ systems is that you face, as David Collingridge observed, a double bind dilemma:
Initially an information problem because ‘real’ safety issues (hazards) and their risk cannot be easily identified or quantified until the system is deployed, but
By the time the system is deployed you now face a power (inertia) problem, that is control or change is difficult once the system is deployed or delivered. Eliminating a hazard is usually very difficult and we can only mitigate them in some fashion.Continue Reading…
The name of the model made me smile, but this article The Ethics of Uncertainty by Tannert, Elvers and Jandrigargues that where uncertainty exists research should be considered as part of an ethical approach to managing risk.
Why saying the wrong thing at the wrong time is sometimes necessary
The Green’s senator Adam Bandt has kicked up a storm of controversy amongst the running dogs of the press after pointing out in this Guardian article that climate change means a greater frequency of bad heat waves which means in turn a greater frequency of bad bush fires. Read the article if you have a moment, I liked his invoking the shade of Ronald Reagan to judge the current government especially. Continue Reading…
The consensus project: Yes there is one on climate change
Despite what you may see in the media, yes there is an overwhelming consensus on climate change (it’s happening), what the cause is (our use of fossil fuels) and what we can do about it (a whole bunch of things with today’s tech). Here’s the link to the projects web page, neat info graphics…enjoy.
Oh and if like me you live in Australia I’d start getting used to the increasing frequency of extreme weather events and bush-fires, the only uncertainty left is whether we can put the brakes on in time to prevent a complete catastrophe.
Taboo transactions and the safety dilemma Again my thanks goes to Ross Anderson over on the Light Blue Touchpaper blog for the reference, this time to a paper by Alan Fiske an anthropologist and Philip Tetlock a social psychologist, on what they terms taboo transactions. What they point out is that there are domains of sharing in society which each work on different rules; communal, versus reciprocal obligations for example, or authority versus market. And within each domain we socially ‘transact’ trade-offs between equivalent social goods.
I was reading a post by Ross Anderson on his dismal experiences at John Lewis, and ran across the term security theatre, I’ve actually heard the term, before, it was orignally coined by Bruce Schneier, but this time it got me thinking about how much activity in the safety field is really nothing more than theatrical devices that give the appearance of achieving safety, but not the reality. From zero harm initiatives to hi-vis vests, from the stylised playbook of public consultation to the use of safety integrity levels that purport to show a system is safe. How much of this adds any real value. Worse yet, and as with security theatre, an entire industry has grown up around this culture of risk, which in reality amounts to a culture of risk aversion in western society. As I see it risk as a cultural concept is like fire, a dangerous tool and an even more terrible master.
In 1996 the European Space Agency lost their brand new Ariane 5 launcher on it’s first flight. Here’s a recently updated annotated version of that report. I’d also note that the software that faulted was written using Ada a ‘strongly typed’ language, which does point to a few small problems with the use of such languages.
A point that Fred Brooks makes in his recent work the Design of Design is that it’s wiser to explicitly make specific assumptions, even if that entails guessing the values, rather than leave the assumption un-stated and vague because ‘we just don’t know’. Brooks notes that while specific and explicit assumptions may be questioned, implicit and vague ones definitely won’t be. If a critical aspect of your design rests upon such fuzzy unarticulated assumptions, then the results can be dire.
Insist on using R = F x C in your assessment. This will panic HR (People go into HR to avoid nasty things like multiplication.)
Put “end of universe” as risk number 1 (Rationale: R = F x C. Since the end of the universe has an infinite consequence C, then no matter how small the frequency F, the Risk is also infinite)
Ignore all other risks as insignificant
Wait for call from HR…
A humorous note, amongst many, in an excellent presentation on the fell effect that bureaucracies can have upon the development of safety critical systems. I would add my own small corollary that when you see warning notes on microwaves and hot water services the risk assessment lunatics have taken over the asylum…
Presumably the use of the crew cab as an escape pod was not actually high on the list of design goals for the 4000 and 4100 class locomotives, and thankfully the locomotives involved in the recent derailment at Ambrose were unmanned.
One of the things that’s concerned me for a while is the potentially malign narrative power of a published safety case. For those unfamiliar with the term, a safety case can be defined as a structured argument supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment. And I have not yet read a safety case that didn’t purport to be exactly that.
I was thinking about how the dubious concept of ‘safety integrity levels’ continues to persist in spite of protracted criticism. in essence if the flaws in the concept of SILs are so obvious why they still persist?
Another in the occasional series of posts on systems engineering, here’s a guide to evaluating technical risk, based on the degree of technical maturity of the solution.
The idea of using technical maturity as an analog for technical risk first appears (to my knowledge) in the 1983 Systems Engineering Management Guide produced by the Defense Systems Management College (1).
Using such analogs is not unusual in engineering, you usually find it practiced where measuring the actual parameter is too difficult. For example architects use floor area as an analog for cost during concept design because collecting detailed cost data at that point is not really feasible.
While you can introduce other analogs, such as complexity and interdependence, as a first pass assessment of inherent feasibility I’ve found that the basic question of ‘have we done this before’ to be a powerful one.
1. The 1983 edition is IMO the best of all the Guides with subsequent editions of the DSMC guide rather more ‘theoretic’ and not as useful, possibly because the 1983 edition was produced by Lockheed Martin Missile and Space Companies Systems Engineering Directorate. Or to put it another way it was produced by people who wrote about how they actually did their job… 🙂
I’ve just finished up the working week with a day long Safety Conversations and Observations course conducted by Dr Robert Long of Human Dymensions. A good, actually very good, course with an excellent balance between the theory of risk psychology and the practicalities of successfully carrying out safety conversations. I’d recommend it to any organisation that’s seeking to take their safety culture beyond systems and paperwork. Although he’s not a great fan of engineers. 🙂
While I’m on the subject of visualising risk the Understanding Uncertainty site run by the University of Cambridge’s Winton Group gives some good examples of how visualisation techniques can present risk.
Resilience and common cause considered in the wake of hurricane Sandy
One of the fairly obvious lessons from Hurricane Sandy is the vulnerability of underground infrastructure such as subways, road tunnels and below grade service equipment to flooding events.
The New York City subway system is 108 years old, but it has never faced a disaster as devastating as what we experienced last night”
NYC transport director Joseph Lhota
Yet despite the obviousness of the risk we still insist on placing such services and infrastructure below grade level. Considering actual rises in mean sea level, e.g a 1 foot increase at Battery Park NYC since 1900, and those projected to occur this century perhaps now is the time to recompute the likelihood and risk of storm surges overtopping defensive barriers.
While the safety community has developed a comprehensive suite of analyses and management techniques for system developments the number of those available to ensure the safe modifications of systems are somewhat less prolific.
Which is odd when one considers that most systems spend the majority of their life in operation rather than development…
The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.
In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.
The development of safety cases for complex safety critical systems
So what is a safety case? The term has achieved an almost quasi-religious status amongst safety practitioners, with it’s fair share of true believers and heretics. But if you’ve been given the job of preparing or reviewing a safety case what’s the next step?
In June of 2011 the Australian Safety Critical Systems Association (ASCSA) published a short discussion paper on what they believed to be the philosophical principles necessary to successfully guide the development of a safety critical system. The paper identified eight management and eight technical principles, but do these principles do justice to the purported purpose of the paper?
The Risk of losing any sum is the reverse of Expectation; and the true measure of it is, the product of the Sum adventured multiplied by the Probability of the Loss.
Abraham de Moivre, De Mensura Sortis, 1711 in the Ph. Trans. of the Royal Society
One of the perennial challenges of system safety is that for new systems we generally do not have statistical data on accidents. High consequence events are, we hope, quite rare leaving us with a paucity of information. So we usually end up basing any risk assessment upon low base rate data, and having to fall back upon some form of subjective (and qualitative) method of risk assessment. Risk matrices were developed to guide such qualitative risk assessments and decision making, and the form of these matrices is based on a mix of classical decision and risk theory. The matrix is widely described in safety and risk literature and has become one of the less questioned staples of risk management. Yet despite it’s long history you still see plenty of poorly constructed risk matrices out there, in both the literature and standards. So this post attempts to establish some basic principles of construction as an aid to improving the state of practice and understanding.
The iso-risk contour
Based on de Moivre’s definition we can define a series of curves that represent the same risk level (termed iso-risk contours) on a two dimensional surface. While this is mathematically correct it’s difficult for people to use in making qualitative evaluations. So decision theorists took the iso-risk contours and zoned them into judgementally tractable cells (or bins) to form the risk matrix. In this risk matrix each cell notionally represent a point on the iso-risk curve and steps in the matrix define the edges of ‘risk zones’. We can also usually plot the curve using log log axes which provide straight line contours which gives us a matrix that looks like the one below.
This binning is intended to make qualitative decisions as to severity or likelihood more tractable as human beings find it easier to select qualitative values from amongst such bins. But unfortunately binning also introduces ambiguity into the risk assessment, if you look at the example given above you’ll see that the iso-risk contour runs through the diagonal of the cell, so ‘off diagonal’ the cell risk is lesser or greater depending on which side of the contour your looking at. So be aware that when you bin a continuous risk contour you pay for easier decision making with by increasing the uncertainty of the underlying assigned risk.
Scaling likelihood and severity
The next problem that faces us in developing a risk matrix is assigning a scale to the severity and likelihood bins. In the example above we have a linear (log log) plot so the value of each succeeding bin’s median point should go up by an order of magnitude. In qualitative terms this defines an ordinal scale ‘probable’ in the example above to be an order of magnitude more likely as ‘remote’, while ‘catastrophic’ another order of magnitude greater in severity than ‘critical’ and so on. There are two good reasons to adopt such a scaling. The first is that it’s an established technique to avoid qualitative under-estimation of values, people generally finding it easier to discriminate between values separated by an order of magnitude than by a linear scale. The second is that if we have a linear iso-risk matrix then (by definition) the scales should also be logarithmic to comply with De Moivre’s equation. Unfortunately, you’ll also find plenty of example linear risk contour matrices that unwittingly violate de Moivre’s equation. Australian Standard AS 4360 as an example. While such matrices may reflect a conscious decision making strategy, for example sensitivity to extreme severities, they don’t reflect De Moivre’s theorem and the classical definition of risk so where you depart you need to be clear as to why you are (dealing with non-ergodic risks is a good one).
Cell numbers and the 4 to 7 rule
Another decision with designing a risk matrices is how many cells to have, too few and the evaluation is too granular, too many and the decision making becomes bogged down in detailed discrimination (which as noted above is hardly warranted). The usual happy compromise sits at around 4 to 7 bins on the vertical and horizontal. Adopting a logarithmic scale gives a wide field of discrimination for a 4 to 7 matrix, but again there’s a trade-off, this time between sizing the bins in order to reduce cognitive workload for the assessor and the information lost by doing so.
The semiotics of colour
The problem with a matrix is that it can only represent two dimensions of information on one graph. Thus a simple matrix may allow us to communicate risk as a function of frequency (F) and severity (S) but we still need a way to graphically associate decision rules with the identified risk zones. The traditional method adopted to do this is to use colour to designate the specific risk zones and a colour key the associated action required. As the example above illustrates the various decision zones of risk are colour coded, with the highest risks being given the most alarming colours. While colour has inherently a strong semiotic meaning, and one intuitively understood by both expert and lay person alike, there is also a potential trap in that by setting the priorities using such a strong code we are subtly forcing an imperative. In these circumstances a colourised matrix can become a tool of persuasion rather than one of communications (Roth 2012). One should therefore carefully consider what form of communication the matrix is intended to support.
Ensure consistency of ordering
A properly formed matrices risk zones should also exhibit what is called ‘weak consistency’, that is the qualitative ordering of risk as defined by the various (coloured) zones and their cells ranks various risks (form high to low) in roughly the same way that a quantitative analysis would do so (Cox 2008). In practical terms what this means is that if you find there are areas of the matrix where the same effort will produce a greater or lesser improvement of risk when compared to another area (Clements 96) you have a problem of construction. You should also (in principal) never be able to jump across two risk decision zones in one movement. For example if a mitigation only reduces the likelihood of occurrence of a hazard only we wouldn’t expect the risk reduction to change as the severity of the risk changed.
Dealing with boundary conditions
In some standards, such as MIL-STD-882, an upper arbitrary bound may be placed on severity, in which case what happens when a severity exceeds the ‘allowable’ threshold? For example, should we discriminate between a risk to more than one person versus one to a single person? For specific domains this may not be a problem, but for others where mass casualty events are a possibility it may well be of concern. In this case if we don’t wish to add columns to the matrix we may define a sub-cell within our matrix to reflect that this is a higher than we thought level of risk. Alternatively we could define the severity for the ‘I’ column as defining a range of severities whose values include at the median point the mandated severity. So for example the catastrophic hazard bin range would be from 1 fatality to 10 fatalities. Looking at likelihood one should also include a likelihood of ‘incredible’ for risks that have been removed, and where there is no credible likelihood of their occurrence, can be recorded rather than deleted. After all just because we’ve retired a hazard today, doesn’t mean a subsequent design change or design error won’t resurrect the hazard to haunt us.
Calibrating the risk matrix
Risk matrices are used to make decisions about risk, and their acceptability. But how do we calibrate the risk matrix to represent an understandably acceptable risk? One way is to pick a specific bin and establish a calibrating risk scenario for it, usually drawn from the real world, for which we can argue the risk is considered broadly acceptable by society (Clements 96). So in the matrix above we could pick cell IE and equate that to an acceptable real world risk that could result in the death of a person (a catastrophic loss). For example, ‘the risk of death in an motor vehicle accident on the way from and to your work on main arterial roads under all weather conditions cumulatively over a 25 year working career‘. This establishes the edge of the acceptable risk zone by definition and allows use to define other risk zones. In general it’s always a good idea to provide a description of what each qualitative bin means so that people understand the meaning. If you need to one can also include numerical ranges for likelihood and severity, such as the loss values in dollars, numbers of injuries sustained and so on.
Define your exposure
One should also consider and define the number of units, people or systems exposed, clearly there is a difference between the cumulative risk posed by say one aircraft and a fleet of one hundred aircraft in service or between one bus and a thousand cars. What may be acceptable at an individual level (for example road accident) may not be acceptable at an aggregated or societal level and risk curves may need to be adjusted accordingly. MIL-STD-882C offers a simple example of this approach.
And then define your duration
Finally and perhaps most importantly you always need to define the duration of exposure for likelihood. Without it the statement is at best meaningless and at worst misleading as different readers will have different interpretations. A 1 in 100 probability of loss over 25 years of life is a very different risk to a 1 in 100 over a single eight hour flight.
A risk matrix is about making decisions so it needs to support the user in that regard, but, it’s use as part of a risk assessment should not be seen as a means of acquitting a duty of care. The principle of ‘so far as is reasonable practicable’ cares very little about risk in the first instance, asking only whether it would be considered reasonable to acquit the hazard. Risk assessments belong at the back end of a program when, despite our efforts we have residual risks to consider as part of evaluating our efforts in achieving a reasonably practicable level of safety. A fact that modern decision makers should keep firmly in mind.
Clements, P. Sverdrup System Safety Course Notes, 1996.
One of the canonical design principles of the nuclear weapons safety community is to base the behaviour of safety devices upon fundamental physical principles. For example a nuclear weapon firing circuit might include capacitors in the firing circuit that, in the event of a fire, will always fail to open circuit thereby safing the weapon. The safety of the weapon in this instance is assured by devices whose performance is based on well understood and predictable material properties. Continue Reading…
I’ve recently been reading John Downer on what he terms the Myth of Mechanical Objectivity. To summarise John’s argument he points out that once the risk of an extreme event has been ‘formally’ assessed as being so low as to be acceptable it becomes very hard for society and it’s institutions to justify preparing for it (Downer 2011).
Did the designers of the japanese seawalls consider all the factors?
In an eerie parallel with the Blayais nuclear power plant flooding incident it appears that the designers of tsunami protection for the Japanese coastal cities and infrastructure hit by the 2011 earthquake did not consider all the combinations of environmental factors that go to set the height of a tsunami.
The Mississippi River’s Old River Control Structure, a National Single Point of Failure?
Given the recent events in Fukushima and our subsequent western cultural obsession with the radiological consequences, perhaps it’s appropriate to reflect on other non-nuclear vulnerabilities. A case in point is the Old River Control Structure erected by those busy chaps the US Army Corp of Engineers to control the path of the Mississippi to the sea. Well as it turns out trapping the Mississippi wasn’t really such a good idea…
Why more information does not automatically reduce risk
I recently re-read the article Risks and Riddles by Gregory Treverton on the difference between a puzzle and a mystery. Treverton’s thesis, taken up by Malcom Gladwell in Open Secrets, is that there is a significant difference between puzzles, in which the answer hinges on a known missing piece, and mysteries in which the answer is contingent upon information that may be ambiguous or even in conflict. Continue Reading…
In a previous post I discussed that in HOT systems the operator will inherently be asked to intervene in situations that are unplanned for by the designer. As such situations are inherently not ‘handled’ by the system this has strong implications for the design of the human machine interface.
With a Bachelor’s in Mechanical Engineering and a Master’s in Systems Engineering, Matthew Squair is a principal consultant with Jacobs Australia. His professional practice is the assurance of safety, software and cyber-security, and he writes, teaches and consults on these subjects. He can be contacted at firstname.lastname@example.org