Archives For Safety

The practice of safety engineering in various high consequence industries.

yellowbook-rail.org.ukThat much beloved safety engineering handbook of the UK rail industry, the Yellow Book, is back. The handbook has been re-released as the International Handbook Engineering Safety Management (iESM).

Re-development is being carried out by Technical Program Delivery Ltd and the original authoring team of Dr Rob Davis, Paul Cheeseman and Bruce Elliot.

As with the original this incarnation is intended to be advisory rather than mandatory, nor does it tie itself to a particular legislative regime.

Volume one of the iESM containing the key processes in 36 pages is now available free of charge from the iESM’s website, enjoy.

Occasional readers of this blog might have noticed my preoccupation with unreliable airspeed and the human factors and system design issues that attend it. So it was with some interest that I read the recent paper by Sathy Silva of MIT and Roger Nicholson of Boeing on aviation accidents involving unreliable airspeed.

Continue Reading…

Unusual attitude

28/02/2013

No, not the alternative name for this blog. 🙂

I’ve just given the post Pitch ladders and unusual attitude a solid rewrite adding some new material and looking a little more deeply at some of the underlying safety myths.

787 Lithium Battery (Image Source: JTSB)

But, we tested it? Didn’t we?

Earlier reports of the Boeing 787 lithium battery initial development indicated that Boeing engineers had conducted tests to confirm that a single cell failure would not lead to a cascading thermal runaway amongst the remaining batteries. According to these reports their tests were successful, so what went wrong?

Continue Reading…

Well it sounded reasonable…

One of the things that’s concerned me for a while is the potentially malign narrative power of a published safety case. For those unfamiliar with the term, a safety case can be defined as a structured argument supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given environment. And I have not yet read a safety case that didn’t purport to be exactly that.

Continue Reading…

20130216-204616.jpg

As my parents in law live in Chelyabinsk I have to admit a personal interest in the recent Russian meteor impact. Continue Reading…

X-Ray of JAL Battery (Image Source: NTSB)

A bit more on Boeing’s battery woes…

The NTSB has released more pictures of the JAL battery, and there are some interesting conclusions that can be drawn from the evidence to date.

Continue Reading…


JAL JA829J Fire (Image Source: Stephan Savoia AP Photo)

Boeing’s Dreamliner program runs into trouble with lithium ion batteries

Lithium batteries performance in providing lightweight, low volume power storage has made them a ubiquitous part of modern consumer life. And high power density also makes them attractive in applications, such as aerospace, where weight and space are at a premium. Unfortunately lithium batteries are also very unforgiving if operated outside their safe operating envelope and can fail in a spectacularly energetic fashion called a thermal runaway (1), as occurred in the recent JAL and ANA 787 incidents.

Continue Reading…

QR Train crash (Image Source: Bayside Bulletin )

It is a fact universally acknowledged that a station platform is invariably in need of a good buffer-stop….

On the 31st of January 2013 a QR commuter train slammed into the end of platform barrier at the Cleveland street station, overrode it and ran into the station structure before coming to rest.

While the media and QR have focused their attention on the reasons for the overrun the failure of the station’s passive defenses against end of track overrun is a more critical concern. Or to put it another way, why did an event as predictable as this, result in the train overriding the platform with potentially fatal consequences?

Continue Reading…

The Bielefield system safety list archive is now active. Thanks go to Peter Ladkin, the gang at Bielefield and Causalis.

Although you would expect a discipline like safety engineering to have a very well defined and agreed set of foundational concepts, strangely the definition of what is a hazard (one such) remains elusive, with a range of different standards introducing differing definitions.

Continue Reading…

Control checks

15/11/2012

Reading Capt. Richard De Crepisgny’s account of the QF32 emergency I noted with interest his surprise in the final approach when the aircraft stall warnings sounded, although the same alarms had been silent when the landing had been ‘dry run’ at 4000 feet (p261 of QF32).  Continue Reading…

QF 32 update

15/11/2012

Just finished updating my post on Lessons from QF 32 with more information from Capt. Richard De Crespigny’s account of the event (which I recommend). His account of the failures experienced provides a system level perspective of the loss of aircraft functions, that augments the preceding component and ECAM data.

This post is part of the Airbus aircraft family and system safety thread.

Resilience and common cause considered in the wake of hurricane Sandy

One of the fairly obvious lessons from Hurricane Sandy is the vulnerability of underground infrastructure such as subways, road tunnels and below grade service equipment to flooding events.

The New York City subway system is 108 years old, but it has never faced a disaster as devastating as what we experienced last night”

NYC transport director Joseph Lhota

Yet despite the obviousness of the risk we still insist on placing such services and infrastructure below grade level. Considering actual rises in mean sea level, e.g a 1 foot increase at Battery Park NYC since 1900, and those projected to occur this century perhaps now is the time to recompute the likelihood and risk of storm surges overtopping defensive barriers.

Continue Reading…

How do we assure safety when we modify a system?

While the safety community has developed a comprehensive suite of analyses and management techniques for system developments the number of those available to ensure the safe modifications of systems are somewhat less prolific.

Which is odd when one considers that most systems spend the majority of their life in operation rather than development…

Continue Reading…

One of the recurring problems in running hazard identification workshops is being faced by a group whose members are passively refusing to engage in the process.

A technique that I’ve found quite valuable in breaking participants out of that mindset is TRIZ, or the Theory of Solving Problems Creatively (teoriya resheniya izobretatelskikh zadatch).

Continue Reading…

The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.

In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.

Continue Reading…

Riding the rocket

22/10/2012

20121022-132423.jpg

Give me warp speed Scotty! We’re blowing this disco!!

Continue Reading…

Just finished giving my post on Lessons from Nuclear Weapons Safety a rewrite.

The original post is, as the title implies, about what we can learn from the principled base approach to safety adopted by the US DOE nuclear weapons safety community. Hopefully the rewrite will make it a little clearer, I can be opaque as a writer sometimes. 🙂

P.S. I probably should look at integrating the 3I principles introduced into this post on the philosophy of safety critical systems.

Warsaw A320 Accident (Image Source: Unknown)

One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?

Continue Reading…

This post is part of the Airbus aircraft family and system safety thread.

I’m currently reading Richard de Crespigny’s book on flight QF 32. In he writes that he felt at one point that he was being over whelmed by the number and complexity of ECAM messages. At that moment he recalled remembering a quote from Gene Kranz, NASA’s flight director, of Apollo 13 fame, “Hold it Gentlemen, Hold it! I don’t care about what went wrong. I need to know what is still working on that space craft.”.

The crew of QF32 are not alone in experiencing the overwhelming flood of data that a modern control system can produce in a crisis situation. Their experience is similar to that of the operators of the Three Mile island nuclear plant who faced a daunting 100+ near simultaneous alarms, or more recently the experiences of QF 72.

The take home point for designers is that, if you’ve carefully constructed a fault monitoring and management system you also need to consider the situation where the damage to the system is so severe that the needs of the operator invert and they need to know ‘what they’ve still got’, rather that what they don’t have.

The term ‘never give up design strategy’ is bandied around in the fault tolerance community, the above lesson should form at least a part of any such strategy.

The development of safety cases for complex safety critical systems

So what is a safety case? The term has achieved an almost quasi-religious status amongst safety practitioners, with it’s fair share of true believers and heretics. But if you’ve been given the job of preparing or reviewing a safety case what’s the next step?

Continue Reading…

In June of 2011 the Australian Safety Critical Systems Association (ASCSA) published a short discussion paper on what they believed to be the philosophical principles necessary to successfully guide the development of a safety critical system. The paper identified eight management and eight technical principles, but do these principles do justice to the purported purpose of the paper?

Continue Reading…

Recently there’s been some robust discussion over on the Safety Critical Mail List at York regarding the utility of safety cases and performance based safety standards (as exemplified by the UK safety case regime) versus more prescriptive design standards (as exemplified by the aerospace industry FAR regulations). To provide one UK regulator’s perspective here’s a presentation by Taf Powell, Director of the Offshore Division of Health and Safety Executive’s Hazardous Industries Directorate, UK, on the state of safety cases in the UK offshore industry circa 2005. Of course his talk was well before the 2010 Deepwater Horizon disaster.

Continue Reading…

For those of you interested in such things here’s a link to a draft copy of Professor Nancy Leveson’s latest book on system safety Engineering a Safer World, and her STAMP methodology.

Like Safeware it looks to become another classic of the system safety canon.

The MIL-STD-882 lexicon of hazard analyses includes the System Hazard Analysis (analysis) which according to the standard is intended to:

“…examines the interfaces between subsystems. In so doing, it must integrate the outputs of the SSHA. It should identify safety problem areas of the total system design including safety critical human errors, and assess total system risk. Emphasis is placed on examining the interactions of the subsystems.”

MIL-STD-882C

This sounds reasonable in theory and I’ve certainly seen a number toy examples touted in various text books on what it should look like. But, to be honest, I’ve never really been convinced by such examples, hence this post.

Continue Reading…

20120722-182815.jpg

One of the canonical design principles of the nuclear weapons safety community is to base the behaviour of safety devices upon fundamental physical principles. For example a nuclear weapon firing circuit might include capacitors in the firing circuit that, in the event of a fire, will always fail to open circuit thereby safing the weapon. The safety of the weapon in this instance is assured by devices whose performance is based on well understood and predictable material properties.
Continue Reading…

In an article published in the online magazine Spectrum Eliza Strickland has charted the first 24 hours at Fukushima. A sobering description of the difficulty of the task facing the operators in the wake of the tsunami.

Her article identified a number of specific lessons about nuclear plant design, so in this post I thought I’d look at whether more general lessons for high consequence system design could be inferred in turn from her list.

Continue Reading…

A330 Right hand (1 & 3) AoA probes (Image source: ATSB)

In an earlier post I commented that in the QF72 incident the use of a geometric mean (1) instead of the arithmetic mean when calculating the aircrafts angle of attack would have reduced the severity of the subsequent pitch over. Which leads into the more general subject of what to do when the real world departs from our assumption about the statistical ‘well formededness’ of data. The problem, in the case of measuring angle of attack on commercial aircraft, is that the left and right alpha sensors are not truly independent measures of the same parameter (2). With sideslip we cannot directly obtain a true angle of attack (AoA) from any single sensor (3) so need to take the average (mean) of the measured AoA on either side of the fuselage (Gracey 1958) to determine the true AoA. Because of this variance between left and right we cannot use a median voting approach, as we can expect the two sensors the right side to differ from the one sensor on the left. As a result we end up having to use the mean of two sensor values (one from each side) as an estimate of the resultant central tendency.

Continue Reading…

I’ve recently been reading John Downer on what he terms the Myth of Mechanical Objectivity. To summarise John’s argument he points out that once the risk of an extreme event has been ‘formally’ assessed as being so low as to be acceptable it becomes very hard for society and it’s institutions to justify preparing for it (Downer 2011).

Continue Reading…

Airbuses side stick improves crew comfort and control, but is there a hidden cost?

This post is part of the Airbus aircraft family and system safety thread.

The Airbus FBW side stick flight control has vastly improved the comfort of aircrew flying the Airbus fleet, much as the original Airbus designers predicted (Corps 1988). But the implementation also expresses the Airbus approach to flight control laws and that companies implicit assumption about the way in which humans interact with automation and each other. Here the record is more problematic.

Continue Reading…

Out of the Loop

14/08/2011

Out of the loop, aircrew and unreliable airspeed at high altitude

The BEA’s third interim report on AF 447 highlights the vulnerability of aircrew when their usually reliable automation fails in the challenging operational environment of high altitude flight.

This post is part of the Airbus aircraft family and system safety thread.

Continue Reading…

How the marking of a traffic speed hump provides a classic example of a false affordance and an unintentional hazard.

Continue Reading...

Thinking about the unintentional and contra-indicating stall warning signal of AF 447 I was struck by the common themes between AF 447 and the Titanic. In both the design teams designed a vehicle compliant to the regulations of the day. But in both cases an implicit design assumption as to how the system would be operated was invalidated.

Continue Reading...

The BEA third interim report on the AF 447 accident raises questions

So I’ve read the BEA report from one end to the other and overall it’s a solid and creditable effort. The report will probably disappoint those who are looking for a smoking gun, once again we see a system accident in which the outcome is derived from a complex interaction of system, environment, circumstance and human behavior.

However I do consider that the conclusions, and therefore recommendations, are hasty and incomplete.

This post is part of the Airbus aircraft family and system safety thread.

Continue Reading…

Why something as simple as control stick design can break an aircrew’s situational awareness

One of the less often considered aspects of situational awareness in the cockpit is the element of knowing what the ‘guy in the other seat is doing’. This is a particularly important part of cockpit error management because without a shared understanding of what someone is doing it’s kind of difficult to detect errors.

Continue Reading…

Air France Tail plane (Image Source: Agencia Brasil CCLA 2.5)

Requirements completeness and the AF447 stall warning

Reading through the BEA’s precis of the data contained on Air France’s AF447 Flight Data Recorder you find that during the final minutes of AF447 the aircrafts stall warning ceased, even though the aircraft was still stalled, thereby removed a significant cue to the aircrew that they had flown the aircraft into a deep stall.

Continue Reading…

A small question for the ATSB

According to the preliminary ATSB report the crew of QF32 took approximately 50 minutes to process all the Electronic Centralised Aircraft Monitor (ECAM) messages. This was despite this normal crew of three being augmented by a check captain in training and a senior check captain.

Continue Reading…

Why NASA is like British Rail

Well more precisely, the structural changes that the American space program is undergoing are akin to those that the British rail industry under went during the 1980s.

The past of space transportation in the US is fundamentally defined by NASA, a large, government owned, monolithic, monopolistic, vertically integrated organisation. Sound familiar? It ought, the same description could be applied to the United Kingdom’s British Rail of the 1980s.

Continue Reading…

Soviet Shuttle was safer by design

According to veteran russian cosmonaut Oleg Kotov, quoted in a New Scientist article the soviet Buran shuttle (1) was much safer than the American shuttle due to fundamental design decisions. Kotov’s comments once again underline the importance to safety of architectural decisions in the early phases of a design.

Continue Reading…

Because they have typically pitch unity ratios (1:1) scales, aircraft primary flight displays provide a pitch display that is limited by the vertical field of view. This display can move very rapidly and be difficult to use in unusual attitude recoveries becoming another adverse performance shaping factor for aircrew in such a scenario. Trials by the USAF have conclusively demonstrated that an articulated style of pitch ladder can reduce disorientation of aircrew in such situations.

Continue Reading...

Planes and Trains

05/07/2011

I attended the annual Rail Safety conference for 2011 earlier in the year and one of the speakers was Group capt Alan Clements, the Director Defence Aviation Safety and Air Force Safety. His presentation was interesting in both where the ADO is going with their aviation safety management system as well as providing some historical perspective, and statistics.

Continue Reading...

One of my somewhat perennial concerns when reviewing a functional hazard analysis (FHA) is what’s termed the completeness question. In this case whether all the potentially hazardous functional failure modes have been considered, and to what degree? Continue Reading…

James Reason would classify this as a violation rather than error

Continue Reading…

Fighter Cockpit Rear View Mirror

What the economic theory of sunk costs tells us about plan continuation bias

Plan continuation bias is a recognised and subtle cognitive bias that tends to force the continuation of an existing plan or course of action even in the face of changing conditions. In the field of aerospace it has been recognised as a significant causal factor in accidents, with a 2004 NASA study finding that in 9 out of the 19 accidents studied aircrew exhibited this behavioural bias. One explanation of this behaviour may be a version of the well known ‘sunk cost‘ economic heuristic.

Continue Reading…

What the Cry Wolf effect tells us about pilot’s problems with unreliable air data

In a recurring series of incidents air crew have consistently demonstrated difficulty in firstly identifying and then subsequently dealing with unreliable air data and warnings. To me figuring out why this difficulty occurs is essential to addressing what has become a significant issue in air safety.
Continue Reading…

Hindsight and AF447

23/06/2011

AF A330-200 F-GZCP (Image Source: P. Kierzkowski)

Knowing the outcome of an accident flight does not ‘explain’ the accident

Hindsight bias and it’s mutually reinforcing cognitive cousin the just world hypothesis are traditional parts of public comment on a major air accident investigation when pilot error is revealed as a causal factor. The public comment in various forum after the release of the BEA’s precis on AF447 is no exception.

This post is part of the Airbus aircraft family and system safety thread.

Continue Reading…

The BEA has released a precis of the data contained on AF447’s Flight Data Recorder and we can know look into the cockpit of AF447 in those last terrifying minutes.

Continue Reading...

Over the years a recurring question raised about the design of FBW aircraft has been whether pilots constrained by software embedded protection laws really have the authority to do what is necessary to avoid an accident? But this question falls into the trap of characterising the software as an entity in and of itself. The real question is should the engineers who developed the software be the final authority?

Continue Reading...

Earthquake and Tsunami damage to Daiichi Fukushima 1 (Image Source: Digital Globe)

Bernard Sieker of Bielefield University has put together graphs of the just released TEPCO plant instrumentation data on the Fukushima Daiichi plant (1).

The operators TEPCO had apparently heavily instrumented the plant prior to the tsunami.

My thanks to his colleague Peter Ladkin in publishing this link on the York Safety Critical Mailing List. Aa Peter points out the data is instructive.

So class homework, put together an event timeline against the instrument data, extra points for pictures. 🙂 Continue Reading…