Archives For Safety

The practice of safety engineering in various high consequence industries.

Black Saturday fires (Image source: ABC)

With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from a post I wrote in the aftermath of the 2009 Victorian Black Saturday fires.

So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…

So how did we get to this?  Well simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the heavy tail of bushfire severity. Putting it even more simply it look’s like we’ve been twisting the dragon’s tail and now it’s woken up…

Screwtape(Image source: end time info)

More infernal statistics

Well, here we are again. Given recent developments in the infernal region it seems like a good time for another post. Have you ever, dear reader, been faced with the problem of how to achieve an unachievable safety target? Well worry no longer! Herewith is Screwtape’s patented man based mitigation medicine.

The first thing we do is introduce the concept of ‘mitigation’, ah what a beautiful word that is. You see it’s saying that it’s OK that your system doesn’t meet its safety target, because you can claim credit for the action of an external mitigator in the environment. Probability wise if the probability of an accident is P_a then P_a equals the product of your systems failure probability P_s and. the probability that some external mitigation also fails P_m or P_a = P_s X P_m. 

So let’s use operator intervention as our mitigator, lovely and vague. But how to come up with a low enough P_m? Easy, we just look at the accident rate that has occurred for this or a like system and assume that these were due to operator mitigation being unsuccessful. Voila, we get our really small numbers. 

Now, an alert reader might point out that this is totally bogus and that P_m is actually the likelihood of operator failure when the system fails. Operators failing, as those pestilential authors of the WASH1400 study have pointed out, is actually quite likely. But I say, if your customer is so observant and on the ball then clearly you are not doing your job right. Try harder or I may eat your soul, yum yum. 

Yours hungrily, 


About time I hear you say! 🙂

Yes I’ve just rewritten a post on functional failure taxonomies to include how to use them to gauge the completeness of your analysis. This came out of a question I was asked in a workshop that went something like, ‘Ok mr big-shot consultant tell us, exactly how do we validate that our analysis is complete?’. That’s actually a fair question, standards like EUROCONTROL’s SAM Handbook and ARP 4761 tell you you ought to, but are not that helpful in the how to do it department. Hence this post.

Using a taxonomy to determine the coverage of the analysis is one approach to determining completeness. The other is to perform at least two analyses using different techniques and then compare the overlap of hazards using a capture/recapture technique. If there’s a high degree of overlap you can be confident there’s only a small hidden population of hazards as yet unidentified. If there’s a very low overlap, you may have a problem.


The 15 commandments of the god  of the machine

Herewith, are the 15 commandments for thine safety critical software as spoken by the machine god unto his prophet Kopetz.

  1. Thou shalt regard the system safety case as thy tabernacle of safety and derive thine critical software failure modes and requirements from it.
  2. Thou shalt adopt a fundamentally safe architecture and define thy fault tolerance hypothesis as part of this. Even unto the definition of fault containment regions, their modes of failure and likelihood.
  3. Thine fault tolerance shall include start-up operating and shutdown states
  4. Thine system shall be partitioned to ‘divide and conquer’ the design. Yea such partitioning shall include the precise specification of component interfaces by time and value such that  all manner of men shall comprehend them
  5. Thine project team shall develop a consistent model of time and state for even unto the concept of states and fault recovery by voting is the definition of time important.
  6. Yea even though thou hast selected a safety architecture pleasing to the lord, yet it is but a house built upon the sand, if no ‘programming in the small’ error detection and fault recovery is provided.
  7. Thou shall ensure that errors are contained and do not propagate through the system for a error idly propagated  to a service interface is displeasing to the lord god of safety and invalidates your righteous claims of independence.
  8. Thou shall ensure independent channels and components do not have common mode failures for it is said that homogenous redundant channels protect only from random hardware failures  neither from the common external cause such as EMI or power loss, nor from the common software design fault.
  9. Thine voting software shall follow the self-confidence principle for it is said that if the self-confidence principle is observed then a correct FCR will always make the correct decision under the assumption of a single faulty FCR, and only a faulty FCR will make false decisions.
  10. Thou shall hide and separate thy fault-tolerance mechanisms so that they do not introduce fear, doubt and further design errors unto the developers of the application code.
  11. Thou shall design your system for diagnosis for it is said that even a righteously designed fault tolerant system my hide such faults from view whereas thy systems maintainers must replace the affected LRU.
  12. Thine interfaces shall be helpful and forgive the operator his errors neither shall thine system dump the problem in the operators lap without prior warning of impending doom.
  13. Thine software shall record every single anomaly for your lord god requires that every anomaly observed during operation must be investigated until a root cause is defined
  14. Though shall mitigate further hazards introduced by your design decisions for better it is that you not program in C++ yet still is it righteous to prevent the dangling of thine pointers and memory leaks
  15. Though shall develop a consistent fault recovery strategy such that even in the face of violations of your fault hypothesis thine system shall restart and never give up.

MH370 underwater search area map (Image source- Australian Govt)

After millions of dollars and years of effort the ATSB has suspended it’s search for the wreck of MH370. There’s some bureaucratic weasel words, but we are done people. Of course had the ATSB applied Bayesian search techniques, as the USN did in the successful search for it’s missing  USS Scorpion, we might actually know where it is.

M1 Risk_Spectrum_redux

A short article on (you guessed it) risk, uncertainty and unpleasant surprises for the 25th Anniversary issue of the UK SCS Club’s Newsletter, in which I introduce a unified theory of risk management that brings together aleatory, epistemic and ontological risk management and formalises the Rumsfeld four quadrant risk model which I’ve used for a while as a teaching aid.

My thanks once again to Felix Redmill for the opportunity to contribute.  đŸ™‚

Joshua Brown screen grab

Keep your eyes on the road, and your hands upon the wheel…

The first fatality involving the use of Tesla’s autopilot* occurred last May. The Guardian reported that the autopilot sensors on the Model S failed to distinguish a white tractor-trailer crossing the highway against a bright sky and promptly tried to drive under the trailer, with decapitating results. What’s emerged is that the driver had a history of driving at speed and also of using the automation beyond the maker’s intent, e.g. operating the vehicle hands off rather than hands on, as the screen grab above indicates. Indeed recent reports indicate that immediately prior to the accident he was travelling fast (maybe too fast) whilst watching a Harry Potter DVD. There also appears to be a community of like minded spirits out there who are intent on seeing how far they can push the automation… sigh.  Continue Reading…