Archives For Safety

The practice of safety engineering in various high consequence industries.

Waaay back in 2002 Chris Holloway wrote a paper that used a fictional civil court case involving the hazardous failure of software to show that much of the expertise and received wisdom of software engineering was, using the standards of the US federal judiciary, junky and at best opinion based.

Rereading the transcripts of Phillip Koopman, and Michael Barr in the 2013 Toyota spaghetti monster case I am struck both by how little things have changed and how far actual state of the industry can be from state of the practice, let alone state of the art. Life recapitulates art I guess, though not in a good way.

Tweedle Dum and Dee (Image source: Wikipedia Commons)

Revisiting the Knight, Leveson experiments

In the through the looking glass world of high integrity systems, the use of N-version programming is often touted as a means to achieve extremely lower failure rates without extensive V&V, due to the postulated independence of failure in independently developed software. Unfortunately this is hockum, as Knight and Leveson amply demonstrated with their N version experiments, but there may actually be advantages to N versioning, although not quite what the proponents of it originally expected.

Continue Reading…

Hazard checklists

06/07/2014 — 1 Comment

As I had to throw together an example checklist for a course I’m running, here it is. I’ve also given a little bit of a commentary on the use, advantages and disadvantages of checklists as well. Enjoy. :)

20140629-112815-41295158.jpg

The DEF STAN 00-55 monster is back!!

That’s right, moves are afoot to reboot the cancelled UK MOD standard for software safety, DEF STAN 00-55. See the UK SCSC’s Event Diary for an opportunity to meet and greet the writers. They’ll have the standard up for an initial look on-line sometime in July as we well, so stay posted.

Continue Reading…

Cleveland street train overrun (Image source: ATSB)

The final ATSB report, sloppy and flawed

The ATSB has released it’s final report into the Cleveland street overrun and it’s disappointing, at least when it comes to how and why a buffer stop that actually materially contributed to an overrun came to be installed at Cleveland street station. I wasn’t greatly impressed by their preliminary report and asked some questions of the ATSB at the time (their response was polite but not terribly forthcoming) so I decided to see what the final report was like before sitting in judgement.

Continue Reading…

NASA safety handbook cover

Way, way back in 2011 NASA published the first volume of their planned two volume epic on system safety titled strangely enough “NASA System Safety Handbook Volume 1, System Safety Framework and Concepts for Implementation“, catchy eh?

Continue Reading…

Current practices in formal safety argument notation such as Goal Structuring Notation (GSN) or Cause Argument Evidence (CAE) rely on the practical argument model developed by the philosopher Toulmin (1958). Toulmin focused on the justification aspects of arguments rather than the inferential and developed a model of these ‘real world’ argument based on facts, conclusions, warrants, backing and qualifier entities.

Using Toulmin’s model from evidence one can draw a conclusion, as long as it is warranted. Said warrant being possibly supported by additional backing, and possibly contingent upon some qualifying statement. Importantly one of the qualifier elements in practical arguments is what Toulmin called a ‘rebuttal’, that is some form of legitimate constraint that may be placed on the conclusion drawn, we’ll get to why that’s important in a second.

Toulmin Argumentation Example

You see Toulmin developed his model so that one could actually analyse an argument, that is argument in the verb sense of, ‘we are having a safety argument’. Formal safety arguments in safety cases however are inherently advocacy positions, and the rebuttal part of Toulmin’s model finds no part in them. In the noun world of safety cases, argument is used in the sense of, ‘there is the 12 volume safety argument on the shelf’, and if the object is to produce something rather than discuss then there’s no need for a claim and rebuttal pairing is there?

In fact you won’t find an explicit rebuttal form in either GSN or CAE as far as I am aware, it seem that the very ‘idea’ of rebuttal has been pruned from the language of both. Of course it’s hard to express a concept if you don’t have the word for it, nice little example of how language form can control the conversation. Language is power so they say.

 

Well I can’t believe I’m saying this but those happy clappers of the software development world, the proponents of Agile, Scrum and the like might (grits teeth), actually, have a point. At least when it comes to the development of novel software systems in circumstances of uncertainty, and possibly even for high assurance systems.

Continue Reading…

For those interested, here’s a draft of the ‘Fundamentals of system safety‘ module from a course that I teach on system safety. Of course if you want the full effect, you’ll just have to come along. :)

MH370 Satellite Image (Image source: AMSA)

MH370 and privileging hypotheses

The further away we’ve moved from whatever event that initiated the disappearance of MH370, the less entanglement there is between circumstances and the event, and thus the more difficult it is to make legitimate inferences about what happened. In essence the signal-to-noise ratio decreases exponentially as the causal distance from the event increases, thus the best evidence is that which is intimately entwined with what was going on onboard MH370 and of lesser importance is that evidence obtained at greater distances in time or space.

Continue Reading…

Triggered transmission of flight data

Continuing airsearch (Image source: Shen Ling REX)

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”

If anything teaches us that the modern media is for the most part bat-shit crazy the continuing whirlwind of speculation does so. Even the usually staid Wall Street Journal has got into the act with speculative reports that MH370 may have flown on for hours under the control of persons or persons unknown… sigh.

Continue Reading…

After the disappearance of MH370 without trace, I’d point out, again, that just as in the case of the AF447, disaster had either floating black boxes or even just a cheap and cheerful locator buoy been fitted we would at least have something to work with (1). But apparently this is simply not a priority with the FAA or JAA. I’d note that ships have been traditionally fitted with barometrically released beacon transmitters, thereby ensuring that their release from a sinking ship.

Undoubtedly we’ll go through the same regulatory minuet of looking at design concepts provided by one or more of the major equipment suppliers whose designs will, no surprise, also be complex, expensive and painful to retrofit thereby giving the regulator the perfect out to shelve the issue. At least until the next aircraft disappears. Let’s chalk it up as another great example of regulatory blindness, which I’m afraid is cold comfort to the relatives of those onboard MH370.

Notes

1. Depending on the jurisdiction, modern airliners do carry different types and numbers of Emergency Locator Transmitter (ELT) beacons.These are either fixed to the airframe or need to be deployed by the crew, meaning  that in anything other than a perfect crash landing at sea they end up on the bottom with the aircraft. Sonar pingers attached to the ‘black box’ flight data and cockpit voice recorders can provide an underwater signal, but their distance is limited, about a thousand metres slant range or so.

Mars code: JPL and risk based design

Monument to the conquerors of space Moscow (Copyright)

Engineers as the agents of evolution

Continue Reading…

20140130-063147.jpg
Reflecting on learning in the aftermath of disaster

There’s been a lot of ink expended on examinations of the causes of the Challenger disaster, whose anniversary passed quietly by yesterday, but are we really the wiser for it?

Continue Reading…

Silver Blaze (Image source: Strand magazine)

Gregory (Scotland Yard detective): “Is there any other point to which you would wish to draw my attention?”
Holmes: “To the curious incident of the dog in the night-time.”
Gregory: “The dog did nothing in the night-time.”
Holmes: “That was the curious incident.”

What you pay attention to dictates what you’ll miss

The point that the great detective was making was that the absence of something was the evidence which the Scotland Yard detective had overlooked. Holmes of course using imagination and intuition did identify that this was in fact the vital clue. Such a plot device works marvelously well because almost all of us, like detective Gregory, fail to recognise that such an absence is actually ‘there’ in a sense, let alone that it’s important.

Continue Reading…

I guess we’re all aware of the wave of texting while driving legislation, as well as recent moves in a number of jurisdictions to make the penalties more draconian. And it seems like a reasonable supposition that such legislation would reduce the incidence of accidents doesn’t it?

Continue Reading…

Over on Emergent Chaos, there’s a post on the unintended consequences of doling out driving privileges to young drivers in stages.

Interestingly the study is circa 2011 but I’ve seen no reflection in Australia on the uncomfortable fact that the study found, i.e that all we are doing with such schemes is shifting the death rate to an older cohort. Of course all the adults can sit back and congratulate themselves on a job well done, except it simply doesn’t work, and worse yet sucks resources and attention away from searching for more effective remedies.

In essence we’ve done nothing as a society to address teenage driving related deaths, safety theatre of the worst sort…

Toyota ECM (Image source: Barr testimony presentation)

Economy of mechanism and fail safe defaults

I’ve just finished reading the testimony of Phil Koopman and Michael Barr given for the Toyota un-commanded acceleration lawsuit. Toyota settled after they were found guilty of acting with reckless disregard, but before the jury came back with their decision on punitive damages, and I’m not surprised.

Continue Reading…

A slightly disturbing story of the car manufacturer, it’s software and what happened when that software failed… or at least was presumed to fail.

20131029-074234.jpg

Why risk communication is tricky…

An interesting post by Ross Anderson on the problems of risk communication, in the wake of the savage storm that the UK has just experienced. Doubly interesting to compare the UK’s disaster communication during this storm to that of the NSW governments during our recent bushfires.

Continue Reading…

One of the perennial issues in regulating the safety of technological systems is how prescriptively one should write the regulations. At one end of the spectrum is a rule based approach, where very specific norms are imposed and at least in theory there is little ambiguity in either their interpretation or application. At the other end you have performance standards, which are much more open-ended, allowing a regulator to make circumstance specific determinations as to whether the standard has been met. Continue Reading…

Fire day

17/10/2013 — Leave a comment

20131017-133339.jpg

Woke to thick smoke in the city today, and things have not improved on what is turning into a bad fire day for us. Our airport at Newcastle is evacuated due to the Hank street fire that’s breached its containment lines and is now burning south east. There’s a fire at Killingworth to the west of the city and grass and scrub fires down the west side of the lake. More than eighty fires state wide, four with emergency warnings and three with watch and waits issued. And the wind is rising…

Titan launch (Image source: USAF)

The human face of nuclear weapons safety

Continue Reading…

20130405-110510.jpg
Provided as part of the QR show bag for the CORE 2012 conference. The irony of a detachable cab being completely unintentional…

Battery post fire (Image source: NTSB)

The NTSB has released it’s interim report on the Boeing 787 JAL battery fire and it appears that Boeing’s initial safety assessment had concluded that the only way in which a battery fire would eventuate was through overcharging. Continue Reading…

Cleveland street train overrun (Image source: ATSB)

The ATSB has released it’s preliminary report of it’s investigation into the Cleveland street overrun accident which I covered in an earlier post, and it makes interesting reading.

Continue Reading…

4100 class crew escape pod #0

On the subject of near misses…

Presumably the use of the crew cab as an escape pod was not actually high on the list of design goals for the 4000 and 4100 class locomotives, and thankfully the locomotives involved in the recent derailment at Ambrose were unmanned.

Continue Reading…

yellowbook-rail.org.ukThat much beloved safety engineering handbook of the UK rail industry, the Yellow Book, is back. The handbook has been re-released as the International Handbook Engineering Safety Management (iESM).

Re-development is being carried out by Technical Program Delivery Ltd and the original authoring team of Dr Rob Davis, Paul Cheeseman and Bruce Elliot.

As with the original this incarnation is intended to be advisory rather than mandatory, nor does it tie itself to a particular legislative regime.

Volume one of the iESM containing the key processes in 36 pages is now available free of charge from the iESM’s website, enjoy.

Occasional readers of this blog might have noticed my preoccupation with unreliable airspeed and the human factors and system design issues that attend it. So it was with some interest that I read the recent paper by Sathy Silva of MIT and Roger Nicholson of Boeing on aviation accidents involving unreliable airspeed.

Continue Reading…

No, not the alternative name for this blog. :)

I’ve just given the post Pitch ladders and unusual attitude a solid rewrite adding some new material and looking a little more deeply at some of the underlying safety myths.

787 Lithium Battery (Image Source: JTSB)

But, we tested it? Didn’t we?

Earlier reports of the Boeing 787 lithium battery initial development indicated that Boeing engineers had conducted tests to confirm that a single cell failure would not lead to a cascading thermal runaway amongst the remaining batteries. According to these reports their tests were successful, so what went wrong?

Continue Reading…

Well it sounded reasonable…

One of the things that’s concerned me for a while is the potentially malign narrative power of a published safety case.

Continue Reading…

20130216-204616.jpg

As my parents in law live in Chelyabinsk I have to admit a personal interest in the recent Russian meteor impact. Continue Reading…

X-Ray of JAL Battery (Image Source: NTSB)

A bit more on Boeing’s battery woes…

The NTSB has released more pictures of the JAL battery, and there are some interesting conclusions that can be drawn from the evidence to date.

Continue Reading…


JAL JA829J Fire (Image Source: Stephan Savoia AP Photo)

Boeing’s Dreamliner program runs into trouble with lithium ion batteries

Lithium batteries performance in providing lightweight, low volume power storage has made them a ubiquitous part of modern consumer life. And high power density also makes them attractive in applications, such as aerospace, where weight and space are at a premium. Unfortunately lithium batteries are also very unforgiving if operated outside their safe operating envelope and can fail in a spectacularly energetic fashion called a thermal runaway (1), as occurred in the recent JAL and ANA 787 incidents.

Continue Reading…

QR Train crash (Image Source: Bayside Bulletin )

It is a fact universally acknowledged that a station platform is invariably in need of a good buffer-stop….

On the 31st of January 2013 a QR commuter train slammed into the end of platform barrier at the Cleveland street station, overrode it and ran into the station structure before coming to rest.

While the media and QR have focused their attention on the reasons for the overrun the failure of the station’s passive defenses against end of track overrun is a more critical concern. Or to put it another way, why did an event as predictable as this, result in the train overriding the platform with potentially fatal consequences?

Continue Reading…

The Bielefield system safety list archive is now active. Thanks go to Peter Ladkin, the gang at Bielefield and Causalis.

Although you would expect a discipline like safety engineering to have a very well defined and agreed set of foundational concepts, strangely the definition of what is a hazard (one such) remains elusive, with a range of different standards introducing differing definitions.

Continue Reading…

Control checks

15/11/2012 — 1 Comment

Reading Capt. Richard De Crepisgny’s account of the QF32 emergency I noted with interest his surprise in the final approach when the aircraft stall warnings sounded, although the same alarms had been silent when the landing had been ‘dry run’ at 4000 feet (p261 of QF32).  Continue Reading…

QF 32 update

15/11/2012 — Leave a comment

Just finished updating my post on Lessons from QF 32 with more information from Capt. Richard De Crespigny’s account of the event (which I recommend). His account of the failures experienced provides a system level perspective of the loss of aircraft functions, that augments the preceding component and ECAM data.

This post is part of the Airbus aircraft family and system safety thread.

Resilience and common cause considered in the wake of hurricane Sandy

One of the fairly obvious lessons from Hurricane Sandy is the vulnerability of underground infrastructure such as subways, road tunnels and below grade service equipment to flooding events.

The New York City subway system is 108 years old, but it has never faced a disaster as devastating as what we experienced last night”

NYC transport director Joseph Lhota

Yet despite the obviousness of the risk we still insist on placing such services and infrastructure below grade level. Considering actual rises in mean sea level, e.g a 1 foot increase at Battery Park NYC since 1900, and those projected to occur this century perhaps now is the time to recompute the likelihood and risk of storm surges overtopping defensive barriers.

Continue Reading…

How do we assure safety when we modify a system?

While the safety community has developed a comprehensive suite of analyses and management techniques for system developments the number of those available to ensure the safe modifications of systems are somewhat less prolific.

Which is odd when one considers that most systems spend the majority of their life in operation rather than development…

Continue Reading…

One of the recurring problems in running hazard identification workshops is being faced by a group whose members are passively refusing to engage in the process.

A technique that I’ve found quite valuable in breaking participants out of that mindset is TRIZ, or the Theory of Solving Problems Creatively (teoriya resheniya izobretatelskikh zadatch).

Continue Reading…

The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’.

The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory.

Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident.

Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.

In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias.

Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.

Continue Reading…

20121022-132423.jpg

Give me warp speed Scotty! We’re blowing this disco!!

Continue Reading…

Just finished giving my post on Lessons from Nuclear Weapons Safety a rewrite.

The original post is, as the title implies, about what we can learn from the principled base approach to safety adopted by the US DOE nuclear weapons safety community. Hopefully the rewrite will make it a little clearer, I can be opaque as a writer sometimes. :-)

P.S. I probably should look at integrating the 3I principles introduced into this post on the philosophy of safety critical systems.

Warsaw A320 Accident (Image Source: Unknown)

One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?

Continue Reading…

I’m currently reading Richard de Crespigny’s book on flight QF 32. In he writes that he felt at one point that he was being over whelmed by the number and complexity of ECAM messages. At that moment he recalled remembering a quote from Gene Kranz, NASA’s flight director, of Apollo 13 fame, “Hold it Gentlemen, Hold it! I don’t care about what went wrong. I need to know what is still working on that space craft.”.

The crew of QF32 are not alone in experiencing the overwhelming flood of data that a modern control system can produce in a crisis situation. Their experience is similar to that of the operators of the Three Mile island nuclear plant who faced a daunting 100+ near simultaneous alarms, or more recently the experiences of QF 72.

The take home point for designers is that, if you’ve carefully constructed a fault monitoring and management system you also need to consider the situation where the damage to the system is so severe that the needs of the operator invert and they need to know ‘what they’ve still got’, rather that what they don’t have.

The term ‘never give up design strategy’ is bandied around in the fault tolerance community, the above lesson should form at least a part of any such strategy.

This post is part of the Airbus aircraft family and system safety thread.