It is highly questionable whether total system safety is always enhanced by allocating functions to automatic devices rather than human operators, and there is some reason to believe that flight-deck automation may have already passed its optimum point.
Archives For Psychology
Human psychology and the role it plays in decision making under uncertainty.
If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.
The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470 this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.
As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).
One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.
I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?
1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.
2. No go alone rules are unfortunately only partially effective.
Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.
The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.
Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.
“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”
Capt. Kevin Sullivan (QF72 flight)
From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.
“Damn-it! We’re going to crash. It can’t be true! (copilot #1)
“But, what’s happening? copilot #2)
AF447 CVR transcript (final words)
Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident. In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.
With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from this post written in the aftermath of the 2009 Victorian Black Saturday fires.
So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…
So how did we get to this? Simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the occurrence rate of extremely dry conditions and the heavy tail of bushfire severity. Yes we’ve been twisting the dragon’s tail and now it’s woken up…
2019 Postscript: Monday 11 November 2019 – NSW
And here we are in 2019 two years down the track from the fires of 2017 and tomorrow looks like being a beyond catastrophic fire day. Firestorms are predicted.
The first fatality involving the use of Tesla’s autopilot* occurred last May. The Guardian reported that the autopilot sensors on the Model S failed to distinguish a white tractor-trailer crossing the highway against a bright sky and promptly tried to drive under the trailer, with decapitating results. What’s emerged is that the driver had a history of driving at speed and also of using the automation beyond the maker’s intent, e.g. operating the vehicle hands off rather than hands on, as the screen grab above indicates. Indeed recent reports indicate that immediately prior to the accident he was travelling fast (maybe too fast) whilst watching a Harry Potter DVD. There also appears to be a community of like minded spirits out there who are intent on seeing how far they can push the automation… sigh. Continue Reading…
A requirements checklist that can be used to evaluate the adequacy of HMI designs in safety critical applications. Based on the work of Nancy Leveson and Matthew Jaffe.
Deconstructing a tail strike incident
On August 1 last year, a Qantas 737-838 (VH-VZR) suffered a tail-strike while taking off from Sydney airport, and this week the ATSB released it’s report on the incident. The ATSB narrative is essentially that when working out the plane’s Takeoff Weight (TOW) on a notepad, the captain forgot to carry the ‘1’ which resulted in an erroneous weight of 66,400kg rather than 76,400kg. Subsequently the co-pilot made a transposition error when carrying out the same calculation on the Qantas iPad resident on-board performance tool (OPT), in this case transposing 6 for 7 in the fuel weight resulting in entering 66,400kg into the OPT. A cross check of the OPT calculated Vref40 speed value against that calculated by the FMC (which uses the aircraft Zero Fuel Weight (ZFW) input rather than TOW to calculate Vref40) would have picked the error up, but the crew mis-interpreted the check and so it was not performed correctly. Continue Reading…
To err is human, but to really screw it up takes a team of humans and computers…
How did a state of the art cruiser operated by one of the worlds superpowers end up shooting down an innocent passenger aircraft? To answer that question (at least in part) here’s a case study that’s part of the system safety course I teach that looks at some of the casual factors in the incident.
In the immediate aftermath of this disaster there was a lot of reflection, and work done, on how humans and complex systems interact. However one question that has so far gone unasked is simply this. What if the crew of the USS Vincennes had just used the combat system as it was intended? What would have happened if they’d implemented a doctrinal ruleset that reflected the rules of engagement that they were operating under and simply let the system do its job? After all it was not the software that confused altitude with range on the display, or misused the IFF system, or was confused by track IDs being recycled… no, that was the crew.
The law of unintended consequences
There are some significant consequences to the principal of reasonable practicability enshrined within the Australian WHS Act. The act is particularly problematic for risk based software assurance standards, where risk is used to determine the degree of effort that should be applied. In part one of this three part post I’ll be discussing the implications of the act for the process industries functional safety standard IEC 61508, in the second part I’ll look at aerospace and their software assurance standard DO-178C then finally I’ll try and piece together a software assurance strategy that is compliant with the Act. Continue Reading…
Ladies and gentlemen you need to leave, like leave your luggage!
This has been another moment of aircraft evacuation Zen.
If you’re interested in observation selection effects Nick Bostrum’s classic on the subject is (I now find out) available online here. A classic example of this is Wald’s work on aircraft survivability in WWII, a naive observer would seek to protect those parts of the returning aircraft that were most damaged, however Wald’s insight was that these were in fact the least critical areas of the aircraft and that the area’s not damaged should actually be the one’s that were reinforced.
The NTSB have released their final report on the Boeing 787 Dreamliner Li-Ion battery fires. The report makes interesting reading, but for me the most telling point is summarised in conclusion seven, which I quote below.
Conclusion 7. Boeing’s electrical power system safety assessment did not consider the most severe effects of a cell internal short circuit and include requirements to mitigate related risks, and the review of the assessment by Boeing authorized representatives and Federal Aviation Administration certification engineers did not reveal this deficiency.
NTSB/AIR-14/01 (p78 )
In other words Boeing got themselves into a position with their safety assessment where their ‘assumed worst case’ was much less worse case than the reality. This failure to imagine the worst ensured that when they aggressively weight optimised the battery design instead of thermally optimising it, the risks they were actually running were unwittingly so much higher.
The first principal is that you must not fool yourself, and that you are the easiest person to fool
Richard P. Feynman
I’m also thinking that the behaviour of Boeing is consistent with what McDermid et al, calls probative blindness. That is, the safety activities that were conducted were intended to comply with regulatory requirements rather than actually determine what hazards existed and their risk.
… there is a high level of corporate confidence in the safety of the [Nimrod aircraft]. However, the lack of structured evidence to support this confidence clearly requires rectifying, in order to meet forthcoming legislation and to achieve compliance.
Nimrod Safety Management Plan 2002 (1)
As the quote from the Nimrod program deftly illustrates, often (2) safety analyses are conducted simply to confirm what we already ‘know’ that the system is safe, non-probative if you will. In these circumstances the objective is compliance with the regulations rather than to generate evidence that our system is unsafe. In such circumstances doing more or better safety analysis is unlikely to prevent an accident because the evidence will not cause beliefs to change, belief it seems is a powerful thing.
The Boeing battery saga also illustrates how much regulators like the FAA actually rely on the technical competence of those being regulated, and how fragile that regulatory relationship is when it comes to dealing with the safety of emerging technologies.
1. As quoted in Probative Blindness: How Safety Activity can fail to Update Beliefs about Safety, A J Rae*, J A McDermid, R D Alexander, M Nicholson (IET SSCS Conference 2014).
2. Actually in aerospace I’d assert that it’s normal practice to carry out hazard analyses simply to comply with a regulatory requirement. As far as the organisation commissioning them is concerned the results are going to tell them what they know already, that the system is safe.
Finding MH370 is going to be a bitch
The aircraft has gone down in an area which is the undersea equivalent of the eastern slopes of the Rockies, well before anyone mapped them. Add to that a search area of thousands of square kilometres in about an isolated a spot as you can imagine, a search zone interpolated from satellite pings and you can see that it’s going to be tough.
As I was asked a question on risk homeostasis at the course I’m teaching, here without further ado is John Adam’s tour de force on The failure of seat belt legislation. Collectively, the group of countries that had not passed seat belt laws experienced a greater decrease than the group that had passed laws. Now John doesn’t directly draw the conclusion, but I will, that the seat belt laws kill more people than they save.
And it gets worse, in 1989 the British Government made seat belt wearing compulsory for children under 14 years old in the rear seats of cars, the result? In the year after there was an increase of almost 10% in the numbers of children killed in rear seats, and of almost 12% in the numbers injured (both above background increases). If not enacted there would be young adults now walking around today enjoying their lives, but of course the legislation was passed and we have to live with the consequences.
Now I could forgive the well intentioned who passed these laws, if when it became apparent that they were having a completely contrary effect they repealed them. But what I can’t forgive is the blind persistence, in practices that clearly kill more than they save. What can we make of this depraved indifference, other than people and organisations will sacrifice almost anything and anyone rather than admit they’re wrong?
Well I can’t believe I’m saying this but those happy clappers of the software development world, the proponents of Agile, Scrum and the like might (grits teeth), actually, have a point. At least when it comes to the development of novel software systems in circumstances of uncertainty, and possibly even for high assurance systems. Are you insane I hear you ask!? Well quite possibly, my mother never had me tested, but leaving that aside we should always take the opportunity to look carefully at what are the un-questioned certitudes of the discipline when a fresh perspective arises.
And in this case the Agile crowd take aim and attempt to skewer one of the central tenets of software engineering, that plans (and documents) are inherently good. So let’s stop and ask ourselves why do we think that this is so? Certainly you might need some sort of plan for the delivery of a +10 million lines of code on the JSF, but do we really think that a vision of the final product was visited upon the management team day zero, and the rest was filling in the gaps?
If the honest answer is ‘no’ to that question on day one, followed by a ‘we really don’t know’ on day two then perhaps the wisest thing is not to sit down and write a 200 page software plan that is based on the lessons of the past, but instead focus upon what Karl Weick called the process of organisational sense-making, and mindfulness. In this Agile, which espouses collaboration, communication and responsiveness to change within a framework of small builds may have a significant advantage over traditional grand strategies in situations of FUD (Fear, Uncertainty and Doubt). When you don’t know what to do, don’t sit down and plan what you don’t know, get people moving, talking, collaborating and making stuff. Then out of that activity you’ll find the information will emerge that will allow you to make decisions.
As Tom Peters points out we need to understand whether our methodologies have an inherent bias for action or a bias for planning, and then whether the situation is complex (but understood and stable) where planning will pay off or uncertain (with high novelty and volatility) where talking, thinking and looking at the small grain issues to build a picture of where we are is what we ought to be doing.
So for that insight at least, the Agile community gets two hallelujah testify moments out of five…
Mindfulness and paying attention to the wrong things
As I talked about in a previous post on the Deepwater Horizon disaster, I believe one of the underlying reasons, perhaps the reason, for Deepwater’s problems escalating to into a catastrophe was the attentional blindness of management to the indicators of problems on the rig, and that this blindness was due in large part to a corporate focus on individual worker injury rates at the expense of thinking about those rare but catastrophic risks that James Reason calls organisational accidents. And, in a coincidence to end all coincidences there was actually a high level management team visiting just prior to the disaster to congratulate the crew as to their seven years of injury free operations.
So it was kind of interesting to read in James Reason’s latest work ‘A Life in Error‘ his conclusion that the road to epic organisational accidents, is paved with declining or low Lost Time Injury Frequency Rates (LTIFR). He goes on to give the following examples in support:
- Westray mining disaster (1992), Canada. 26 miners died, but the company had received an award for reducing the LTIFR,
- Moura mining disaster (1994), Queensland. 11 miners died. The company had halved its LTIFR in the four years preceding the accident.
- Longford gas plant explosion (1998), Victoria. Two died, eight injured. Safety was directed to reducing LTIFR rather than identifying and fixing the major hazards of un-repaired equipment.
- Texas City explosion (2005), Texas. The Independent Safety Review panel identified that BP relied on injury rates to evaluate safety performance.
As Reason concludes, the causes of accidents that result in a direct (and individual injury) are very different to those that result in what he calls an organisational accident, that is one that is both rare and truly catastrophic. Therefore data gathered on LTIFR tells you nothing about the likelihood of such a catastrophic event, and as it turns out can be quite misleading. My belief is that not only is such data misleading, it’s salience actively channelises management attention, thereby ensuring the organisation is effectively unable to see the indications of impending disaster.
So if you see an organisation whose operations can go catastrophically wrong, but all you hear from management is proud pronouncements as to how they’re reducing their loss time injury rate then you might want to consider maintaining a safe, perhaps very safe, distance.
Reason’s A Life in Error is an excellent read by the way, I give if four omitted critical procedural steps out of five. 🙂
How do we give meaning to experience in the midst of crisis?
Instead people strive to create a view of it by establishing a common framework into which events can be fitted to makes sense of the world, what Weick (1993) calls a process of sensemaking. And what is true for individuals is also true for the organisations they make up. In return people also use an organisation to make sense of what’s going on, especially in situations of uncertainty, ambiguity or contradiction.
Gregory (Scotland Yard detective): “Is there any other point to which you would wish to draw my attention?”
Holmes: “To the curious incident of the dog in the night-time.”
Gregory: “The dog did nothing in the night-time.”
Holmes: “That was the curious incident.”
What you pay attention to dictates what you’ll miss
The point that the great detective was making was that the absence of something was the evidence which the Scotland Yard detective had overlooked. Holmes of course using imagination and intuition did identify that this was in fact the vital clue. Such a plot device works marvelously well because almost all of us, like detective Gregory, fail to recognise that such an absence is actually ‘there’ in a sense, let alone that it’s important.
John Adams has an interesting take on the bureaucratic approach to risk management in his post reducing zero risk.
The problem is that each decision to further reduce an already acceptably low risk is always defended as being ‘cheap’, but when you add up the increments it’s the death of a thousand cuts, because no one ever considers the aggregated opportunity cost of course.
This remorseless slide of our public and private institutions into a hysteria of risk aversion seems to me to be be due to an inherent societal psychosis that nations sharing the english common law tradition are prone to. At best we end up with pointless safety theatre, at worst we end up bankrupting our culture.
In a slight segue, I was reading Bruce Schneier’s blog on security and came across this post on the psychology behind fraud. Bruce points to this post on why, yes I know, ‘good people do bad things’. The explanation that researchers such as Ann Tenbrunsel of Notre Dame offer is that in the same way that we are boundedly rational in other aspects of decision making so to are our ethical decisions.
In particular, the way in which decision problems were framed seems to have a great impact upon how we make decisions. Basically if a problem was framed without an ethical dimension then decision makers were much less likely to consider that aspect.
Additionally to framing effects, researchers found in studying collusion in fraud cases most people seem to act from an honest desire simply to help others, regardless of any attendant ethical issues.
What fascinates me is how closely such research parallels the work in system safer and human error. Clearly if management works within a frame based upon performance and efficiency, they are simply going to overlook the down side completely, and in a desire to be helpful why everyone else ‘goes along for the ride’.
There is as I see it a concrete recommendation that come out of this research that we can apply to safety; that fundamentally safety management systems need to be designed to take account of of our weaknesses as boundedly rational actors.
One of the perennial issues in regulating the safety of technological systems is how prescriptively one should write the regulations. At one end of the spectrum is a rule based approach, where very specific norms are imposed and at least in theory there is little ambiguity in either their interpretation or application. At the other end you have performance standards, which are much more open-ended, allowing a regulator to make circumstance specific determinations as to whether the standard has been met. Continue Reading…
Taboo transactions and the safety dilemma Again my thanks goes to Ross Anderson over on the Light Blue Touchpaper blog for the reference, this time to a paper by Alan Fiske an anthropologist and Philip Tetlock a social psychologist, on what they terms taboo transactions. What they point out is that there are domains of sharing in society which each work on different rules; communal, versus reciprocal obligations for example, or authority versus market. And within each domain we socially ‘transact’ trade-offs between equivalent social goods.
I was reading a post by Ross Anderson on his dismal experiences at John Lewis, and ran across the term security theatre, I’ve actually heard the term, before, it was orignally coined by Bruce Schneier, but this time it got me thinking about how much activity in the safety field is really nothing more than theatrical devices that give the appearance of achieving safety, but not the reality. From zero harm initiatives to hi-vis vests, from the stylised playbook of public consultation to the use of safety integrity levels that purport to show a system is safe. How much of this adds any real value. Worse yet, and as with security theatre, an entire industry has grown up around this culture of risk, which in reality amounts to a culture of risk aversion in western society. As I see it risk as a cultural concept is like fire, a dangerous tool and an even more terrible master.
From Les Hatton, here’s how, in four easy steps:
- Insist on using R = F x C in your assessment. This will panic HR (People go into HR to avoid nasty things like multiplication.)
- Put “end of universe” as risk number 1 (Rationale: R = F x C. Since the end of the universe has an infinite consequence C, then no matter how small the frequency F, the Risk is also infinite)
- Ignore all other risks as insignificant
- Wait for call from HR…
A humorous note, amongst many, in an excellent presentation on the fell effect that bureaucracies can have upon the development of safety critical systems. I would add my own small corollary that when you see warning notes on microwaves and hot water services the risk assessment lunatics have taken over the asylum…
With apologies to the philosopher George Santayana, I’ll make the point that the BMW Head Up Display technology is in fact not the unalloyed blessing premised by BMW in their marketing material.
On the subject of near misses…
Presumably the use of the crew cab as an escape pod was not actually high on the list of design goals for the 4000 and 4100 class locomotives, and thankfully the locomotives involved in the recent derailment at Ambrose were unmanned.
Occasional readers of this blog might have noticed my preoccupation with unreliable airspeed and the human factors and system design issues that attend it. So it was with some interest that I read the recent paper by Sathy Silva of MIT and Roger Nicholson of Boeing on aviation accidents involving unreliable airspeed.
But, we tested it? Didn’t we?
Earlier reports of the Boeing 787 lithium battery initial development indicated that Boeing engineers had conducted tests to confirm that a single cell failure would not lead to a cascading thermal runaway amongst the remaining batteries. According to these reports their tests were successful, so what went wrong?
Well it sounded reasonable…
One of the things that’s concerned me for a while is the potentially malign narrative power of a published safety case. For those unfamiliar with the term, a safety case can be defined as a structured argument supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment. And I have not yet read a safety case that didn’t purport to be exactly that.
Why sometimes simpler is better in safety engineering.
I’ve just finished up the working week with a day long Safety Conversations and Observations course conducted by Dr Robert Long of Human Dymensions. A good, actually very good, course with an excellent balance between the theory of risk psychology and the practicalities of successfully carrying out safety conversations. I’d recommend it to any organisation that’s seeking to take their safety culture beyond systems and paperwork. Although he’s not a great fan of engineers. 🙂
One of the recurring problems in running hazard identification workshops is being faced by a group whose members are passively refusing to engage in the process.
A technique that I’ve found quite valuable in breaking participants out of that mindset is TRIZ, or the Theory of Solving Problems Creatively (teoriya resheniya izobretatelskikh zadatch).
The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.
In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.
Just finished reading the excellent paper A Conundrum: Logic, Mathematics and Science Are Not Enough by John Holloway on the the swirling currents of politics, economics and emotion that can surround and affect any discussions of safety. The paper neatly illustrates why the canonical rational-philosophical model of expert knowledge is inherently flawed.
What I find interesting as a practicing engineer is that although every day debates and discussions with your peers emphasise the subjectivity of engineering ‘knowledge’ as engineers we all still like to pretend and behave as if it is not.
The “‘Oh #%*!”, moment captured above definitely qualifies for the vigorous application of the rule that when the fire’s too hot, the water’s too deep or the smoke’s too thick leave. 🙂
But in fact in this incident the pilot actually had to convince the navigator that he needed to leave ‘right now!’. The navigator it turned out was so fixated on shutting down the aircrafts avionics system he didn’t realise how bad thing were, nor recognise that immediate evacuation was the correct response.
In a recent NRCOHSR white paper on the Deeepwater Horizon explosion Professor Andrew Hopkins of the Australian National University argued that the Transocean and BP management teams that were visiting the rig on the day of the accident failed to detect the unsafe well condition because of biases in their audit practices.