Archives For Risk

What is risk, how dow we categorise it and deal with it.

Black Saturday


So ten years on from the Black Saturday fire we’re taking a national moment to remember the unstinting heroism displayed in the face of the hell that was Black Saturday day and all that was lost. Of course we’re not quite so diligent in remembering that we haven’t prevented people rebuilding in high risk areas, nor in improving the fire resistance of people’s homes, nor yet in managing down the risk through realistic burn off programs, nor for that matter have we noticed that the fuel burden in 2018 is back up to the same levels as it was ten years ago. So perhaps instead we should reflect on how we’ve squandered the opportunity for reform. And perhaps we should remember that a fire will come again, to burn our not so clever country .

…it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

Hans Moravec

When you look at the safety performance of industries which have a consistent focus on safety as part of the social permit, nuclear or aviation are the canonical examples, you see that over time increases in safety tend to plateau out. This looks like some form of a learning curve, but what’s the mechanism, or mechanisms that actually drives this process? I believe there are two factors at play here, firstly the increasing marginal cost of improvement and secondly the problem of learning from events that we are trying to prevent.

Increasing marginal cost is simply an economist’s way of stating that it will cost more to achieve that next increment in performance. For example, airbags are more expensive than seat-belts by roughly an order of magnitude (based on replacement costs) however airbags only deliver 8% reduced mortality when used in conjunction with seat belts, see Crandall (2001). As a result the next increment in safety takes longer and costs more (1).

The learning factor is in someways like an informational version of the marginal cost rule. As we reduce accident rates accidents become rarer. Now one of the traditional ways in which safety improvements occur is through studying accidents when they occur and then to eliminate or mitigate identified causal factors. Obviously as the accident rate decreases this likewise the opportunity for improvement also decreases. When accidents do occur we have a further problem because (definitionally) the cause of the accident will comprise a highly unlikely combination of factors that are needed to defeat the existing safety measures. Corrective actions for such rare combination of events therefore are highly specific to that event’s context and conversely will have far less universal applicability.  For example the lessons of metal fatigue learned from the Comet airliner disaster has had universal applicability to all aircraft designs ever since. But the QF-72 automation upset off Learmouth? Well those lessons, relating to the specific fault tolerance architecture of the A330, are much harder to generalise and therefore have less epistemic strength.

In summary not only does it cost more with each increasing increment of safety but our opportunity to learn through accidents is steadily reduced as their arrival rate and individual epistemic value (2) reduce.


1. In some circumstances we may also introduce other risks, see for example the death and severe injury caused to small children from air bag deployments.

2. In a Popperian sense.


1. Crandall, C.S., Olson, L.M.,  P. Sklar, D.P., Mortality Reduction with Air Bag and Seat Belt Use in Head-on Passenger Car Collisions, American Journal of Epidemiology, Volume 153, Issue 3, 1 February 2001, Pages 219–224,

Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.

Piece of wing found on La Réunion Island, is that could be flap of #MH370 ? Credit: reunion 1ere

The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470  this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.

As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).

One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.

I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?


1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.

2. No go alone rules are unfortunately only partially effective.


Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.

One of the perennial problems we face in a system safety program is how to come up with a convincing proof for the proposition that a system is safe. Because it’s hard to prove a negative (in this case the absence of future accidents) the usual approach is to pursue a proof by contradiction, that is develop the negative proposition that the system is unsafe, then prove that this is not true, normally by showing that the set of identified specific propositions of `un-safety’ have been eliminated or controlled to an acceptable level.  Enter the term `hazard’, which in this context is simply shorthand for  a specific proposition about the unsafeness of a system. Now interestingly when we parse the set of definitions of hazard we find the recurring use of terms like, ‘condition’, ‘state’, ‘situation’ and ‘events’ that should they occur will inevitably lead to an ‘accident’ or ‘mishap’. So broadly speaking a hazard is a explanation based on a defined set of phenomena, that argues that if they are present, and given there exists some relevant domain source (1) of hazard an accident will occur. All of which seems to indicate that hazards belong to a class of explanatory models called covering laws. As an explanatory class Covering laws models were developed by the logical positivist philosophers Hempel and Popper because of what they saw as problems with an over reliance on inductive arguments as to causality.

As a covering law explanation of unsafeness a hazard posits phenomenological facts (system states, human errors, hardware/software failures and so on) that confer what’s called nomic expectability on the accident (the thing being explained). That is, the phenomenological facts combined with some covering law (natural and logical), require the accident to happen, and this is what we call a hazard. We can see an archetypal example in the Source-Mechanism-Outcome model of Swallom, i.e. if we have both a source and a set of mechanisms in that model then we may expect an accident (Ericson 2005). While logical positivism had the last nails driven into it’s coffin by Kuhn and others in the 1960s and it’s true, as Kuhn and others pointed out, that covering model explanations have their fair share of problems so to do other methods (2). The one advantage that covering models do possess over other explanatory models however is that they largely avoid the problems of causal arguments. Which may well be why they persist in engineering arguments about safety.


1. The source in this instance is the ‘covering law’.

2. Such as counterfactual, statistical relevance or causal explanations.


Ericson, C.A. Hazard Analysis Techniques for System Safety, page 93, John Wiley and Sons, Hoboken, New Jersey, 2005.

We are hectored almost daily basis on the imminent threat of islamic extremism and how we must respond firmly to this real and present danger. Indeed we have proceeded far enough along the escalation of response ladder that this, presumably existential threat, is now being used to justify talk of internment without trial. So what is the probability that if you were murdered, the murderer would be an immigrant terrorist?

In NSW in 2014 there were 86 homicides, of these 1 was directly related to the act of a homegrown islamist terrorist (1). So there’s a 1 in 86 chance that in that year if you were murdered it was at the hands of a mentally disturbed asylum seeker (2). Hmm sounds risky, but is it? Well there was approximately 2.5 million people in NSW in 2014 so the likelihood of being murdered (in that year) is in the first instance 3.44e-5. To figure out what the likelihood of being murdered and that murder being committed by a terrorist  we just multiply this base rate by the probability that it was at the hands of a `terrorist’, ending up with 4e-7 or 4 chances in 10 million that year. If we consider subsequent and prior years where nothing happened that likelihood becomes even smaller.

Based on this 4 in 10 million chance the NSW government intends to build a super-max 2 prison in NSW, and fill it with ‘terrorists’ while the Federal government enacts more anti-terrorism laws that take us down the road to the surveillance state, if we’re not already there yet. The glaring difference between the perception of risk and the actuality is one that politicians and commentators alike seem oblivious to (3).


1. One death during the Lindt chocolate siege that could be directly attributed to the `terrorist’.

2. Sought and granted in 2001 by the then Liberal National Party government.

3. An action that also ignores the role of prisons in converting inmates to Islam as a route to recruiting their criminal, anti-social and violent sub-populations in the service of Sunni extremists.

With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from this post written in the aftermath of the 2009 Victorian Black Saturday fires.

So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…

So how did we get to this?  Simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the occurrence rate of extremely dry conditions and the heavy tail of bushfire severity. Yes we’ve been twisting the dragon’s tail and now it’s woken up…

2019 Postscript: Monday 11 November 2019 – NSW

And here we are in 2019 two years down the track from the fires of 2017 and tomorrow looks like being a beyond catastrophic fire day. Firestorms are predicted.

Matrix (Image source: The Matrix film)

How algorithm can kill…

So apparently the Australian Government has been buying it’s software from Cyberdyne Systems, or at least you’d be forgiven for thinking so given the brutal treatment Centerlink’s autonomous debt recovery software has been handing out to welfare recipients who ‘it’ believes have been rorting the system. Yep, you heard right it’s a completely automated compliance operation (well at least the issuing part).  Continue Reading…

Donald Trump

Image source: AP/LM Otero

A Trump presidency in the wings who’d have thought! And what a total shock it was to all those pollsters, commentators and apparatchiks who are now trying to explain why they got it so wrong. All of which is a textbook example of what students of risk theory call a Black Swan event. Continue Reading…

Accidents of potentially catastrophic potential pose a particular challenge to classical utilitarian theories of managing risk. A reader of this blog might be aware of how the presence of possibility of irreversible catastrophic outcomes (i.e. non-ergodicity) undermines a key assumption on which classical risk assessment is based. But what to do about it? Well one thing we can practically do is to ensure that when we assess risk we take into account the irreversible (non-ergodic) nature of such catastrophes and there are good reasons that we should do so, as the law does not look kindly on organisations (or people) who make decisions about risk of death purely on the basis of frequency gambling.

A while ago I put together a classical risk matrix (1) that treated risk in accordance with De Moivre’s formulation and I’ve modified this matrix to explicitly address non-ergodicity. The modification is to the extreme (catastrophic) severity column where I’ve shifted the boundary of unacceptable risk downwards to reflect that the (classical) iso-risk contour in that catastrophic case under-estimates the risk posed by catastrophic irreversible outcomes. The matrix now also imposes claim limits on risk where a SPOF may exist that could result in a catastrophic loss (2). We end up with something that looks a bit like the matrix below (3).


From a decision making perspective you’ll note that not only is the threshold for unacceptable risk reduced but that for catastrophic severity (one or more deaths) there is no longer a ‘acceptable’ threshold. This is an important consideration reflecting as it does the laws position that you cannot in gamble away your duty of care, e.g justify not taking an action purely the basis of a risk threshold (4).  The final outcome of this work, along with revised likelihood and severity definitions, can be found in hazard matrix V1.1 (5). I’m still thinking about how you might introduce more consideration of epistemic and ontological risks into the matrix, it’s a work in progress.


1. Mainly to provide a canonical example of what a well constructed matrix should look like as there are an awful lot of bad ones floating around.

2. You have to either eliminate the SPOF or reduce the severity. There’s an implied treatment of epistemic uncertainty in such a claim limit that I find appealing.

3. The star represents a calibration point that’s used when soliciting subjective assessments of likelihood from SME.

4.  By the way you’re not going to find these sort of considerations in ISO 31000.

5. Important note. like all risk matrices it needs to be calibrated to the actual circumstances and risk appetite of the organisation. No warranty given and YMMV.

Anna Johnson on boycotting the census

M1 Risk_Spectrum_redux

A short article on (you guessed it) risk, uncertainty and unpleasant surprises for the 25th Anniversary issue of the UK SCS Club’s Newsletter, in which I introduce a unified theory of risk management that brings together aleatory, epistemic and ontological risk management and formalises the Rumsfeld four quadrant risk model which I’ve used for a while as a teaching aid.

My thanks once again to Felix Redmill for the opportunity to contribute.  🙂


Safety cases and that room full of monkeys

Back in 1943, the French mathematician Émile Borel published a book titled Les probabilités et la vie, in which he stated what has come to be called Borel’s law which can be paraphrased as, “Events with a sufficiently small probability never occur.” Continue Reading…

MH 370 search vessel (Image source: ATSB)

Once more with feeling

Sonar vessels searching for Malaysia Airlines Flight MH370 in the southern Indian Ocean may have missed the jet, the ATSB’s Chief Commissioner Martin Dolan has told News Online. he went on to point out the uncertainties involved, the difficulty of terrain that could mask the signature of wreckage and that therefore problematic areas would need to be re-surveyed. Despite all that the Commissioner was confident that the wreckage site would be found by June. Me I’m not so sure.

Continue Reading…

The long gone, but not forgotten, second issue of the UK MoD’s safety management standard DEFSTAN 00-56 introduced the concept of a qualitative likelihood of Incredible, this is however not just another likelihood category. The intention of the standard writers was that it would be used to capture risks that were deemed effectively impossible to occur, given the assumptions about the domain and system. The category was be applied to those scenarios where the hazard had been designed out, where the design concept had been assessed and it turns out that the posited hazard was just not applicable or where some non-probabilistic technique is used to verify the safety of the system (think mathematical proof). Such a category records that yes, it’s effectively impossible, while retaining the record of assessment should it become necessary to revisit it, a useful mechanism.

A.1.19 Incredible. Believed to have a probability of occurrence too low for expression in meaningful numerical terms.

DEFSTAN 00-56 Issue 2

I’ve seen this approach mangled in a number or hazard analyses were the disjoint nature of the incredible category was not recognised and it was thereafter assigned a specific likelihood that followed on in a decadal fashion from the next highest category. Yes difficulties ensued. The key is that the Incredible is not the next likelihood bin after Improbable it is in fact beyond the end of the line where we park those hazards that we have judged to have an immeasurably small likelihood of occurrence. This, we are asserting, will not happen and we are as confident of that fact as one can ever be.

“Incredible” may be exceptionally defined in terms of reasoned argument that does not rely solely on numerical probabilities.

DEFSTAN 00-56 Issue 2

To put it another way the category reflects a statement of our degree of belief that an event will not occur rather than an assertion as to its frequency of occurrence as the other subjective categories do. What the standard writers have unwittingly done is introduce a superset, in which the ‘no hazard exists’ set is represented by Incredible and the other likelihoods form the ‘a hazard exists’ set. All of which starts to sound like an mashup of frequentist probabilities with Dempster Shafer  belief structures. Promising, it’s a pity the standard committee didn’t take the concept further.


The other pity is that the standard committee didn’t link this idea of “incredible” to Borel’s law. Had they done so we would have a mechanism to make explicit what I call the infinite monkey’s safety argument.

Crowely (Image source: Warner Bro's TV)

The psychological basis of uncertainty

There’s a famous psychological experiment conducted by Ellsberg, called eponymously the Ellsberg paradox, in which he showed that people overwhelmingly prefer a betting scenario in which the probabilities are known, rather than one in which the odds are actually ambiguous, even if the potential for winning might be greater.  Continue Reading…


One of the problems that we face in estimating risk driven is that as our uncertainty increases our ability to express it in a precise fashion (e.g. numerically) weakens to the point where for deep uncertainty (1) we definitionally cannot make a direct estimate of risk in the classical sense. Continue Reading…

Perusing the FAA’s system safety handbook while doing some research for a current job, I came upon an interesting definition of severities. What’s interesting is that the FAA introduces the concept of safety margin reduction as a specific form of severity (loss).

Here’s a summary of Table (3-2) form the handbook:

  • Catastrophic – ‘Multiple fatalities and/or loss of system’
  • Major – ‘Significant reduction in safety margin…’
  • Minor – ‘Slight reduction in safety margin…’

If we think about safety margins for a functional system they represent a system state that’s a precursor to a mishap, with the margin representing some intervening set of states. But a system state of reduced safety margin (lets call it a hazard state) is causally linked to a mishap state, else we wouldn’t care, and must therefore inherit it’s severity. The problem is that in the FAA’s definition they have arbitrarily assigned severity levels to specific hazardous degrees of safety margin reduction, yet all these could still be linked causally to a catastrophic event, e.g. a mid-air collision.

What the FAA’s Systems Engineering Council (SEC) has done is conflate severity with likelihood, as a result their severity definition is actually a risk definition, at least when it comes to safety margin hazards. The problem with this approach is that we end up under treating risks as per classical risk theory. For example say we have a potential reduction in safety margin, which is also casually linked to a catastrophic outcome. Now per Table 3-2 if the reduction was classified as ‘slight’, then we would assess the probability and given the minor severity decide to do nothing, even though in reality the severity is still catastrophic. If, on the other hand, we decided to make decisions based on severity alone, we would still end up making a hidden risk judgement depending on what the likelihood of propagation form hazard state to accident state was (undefined in the handbook). So basically the definitions set you up for trouble even before you start.

My guess is that the SEC decided to fill in the lesser severities with hazard states because for an ATM system true mishaps tend to be invariably catastrophic, and they were left scratching their head for lesser severity mishap definitions. Enter the safety margin reduction hazard. The take home from all this is that severity needs to be based on the loss event, introducing intermediate hybrid hazard/severity state definitions leads inevitably to incoherence of your definition of risk. Oh and (as far as I am aware) this malformed definition has spread everywhere…


With much pomp and circumstance the attorney general and our top state security mandarin’s have rolled out the brand new threat level advisory system. Congrats to us, we are now the proud owners of a five runged ladder of terror. There’s just one small teeny tiny insignificant problem, it just doesn’t work. Yep that’s right, as a tool for communicating it’s completely void of meaning, useless in fact, a hopelessly vacuous piece of security theatre.

You see the levels of this scale are based on likelihood. But whoever designed the scale forgot to include over what duration they were estimating the likelihood. And without that duration it’s just a meaningless list of words. 

Here’s how likelihood works. Say you ask me whether it’s likely to rain tomorrow, I say ‘unlikely’, now ask me whether it will rain in the next week, well that’s a bit more likely isn’t it? OK, so next you ask me whether it’ll rain in the next year? Well unless you live in Alice Springs the answer is going to be even more likely, maybe almost certain isn’t it? So you can see that the duration we’re thinking of affects the likelihood we come up with because it’s a cumulative measure. 

Now ask me whether a terrorist threat was going to happen tomorrow? I’d probably say it was so unlikely that it was, ‘Not expected’. But if you asked me whether one might occur in the next year I’d say (as we’re accumulating exposure) it’d be more likely, maybe even ‘Probable’ while if the question was asked about a decade of exposure I’d almost certainly say it was,  ‘Certain’. So you see how a scale without a duration means absolutely nothing, in fact it’s much worse than nothing, it actually causes misunderstanding because I may be thinking in threats across the next year, while you may be thinking about threats occurring in the next month. So it actually communicates negative information.

And this took years of consideration according to the Attorney General, man we are governed by second raters. Puts head in hands. 

Screwtape(Image source: end time info)

How to deal with those pesky high risks without even trying

Screwtape here,

One of my clients recently came to me with what seemed to be an insurmountable problem in getting his facility accepted despite the presence of an unacceptably high risk of a catastrophic accident. The regulator, not happy, likewise all those mothers with placards outside his office every morning. Most upsetting. Not a problem said I, let me introduce you to the Screwtape LLC patented cut and come again risk refactoring strategy. Please forgive me now dear reader for without further ado we must do some math.

Risk is defined as the loss times probability of loss or R = L x P (1), which is the reverse of expectation, now interestingly if we have a set of individual risks we can add them together to get the total risk, for our facility we might say that total risk is R_f = (R_1 + R_2 + R_3 … + R_n). ‘So what Screwtape, this will not pacify those angry mothers!’ I hear you say? Ahh, now bear with me as I show you how we can hide, err I mean refactor, our unacceptable risk in plain view. Let us also posit that we have a number of systems S_1, S_2, S_3 and so on in our facility… Well instead of looking at the total facility risk, let’s go down inside our facility and look at risks at the system level. Given that the probability of each subsystem causing an accident is (by definition) much less, why then per system the risk must also be less! If you don’t get an acceptable risk at the system level then go down to the subsystem, or equipment level.

The fin de coup is to present this ensemble of subsystem risks as a voluminous and comprehensive list (2), thereby convincing everyone of the earnestness of your endeavours, but omit any consideration of ensemble risk (3). Of course one should be scrupulously careful that the numbers add up, even though you don’t present them. After all there’s no point in getting caught for stealing a pence while engaged in purloining the Bank of England! For extra points we can utilise subjective measures of risk rather than numeric, thereby obfuscating the proceedings further.

Needless to say my client went away a happy man, the facility was built and the total risk of operation was hidden right there in plain sight… ah how I love the remorseless bloody hand of progress.

Infernally yours,



1. Where R = Risk, L = Loss, and P = Probability after De’Moivre. I believe Screwtape keeps De’Moivre’s heart in a jar on his desk. (Ed.).

2. The technical term for this is a Preliminary Hazard Analysis.

3. Screwtape omitted to note that total risk remains the same, all we’ve done is budgeted it out across an ensemble of subsystems, i.e. R_f = R_s1 + R_s2 + R_s3 (Ed.).







Why probability is not corroboration

The IEC’s 61508 standard on functional safety  assigns a series of Safety Integrity Levels (SIL) that correlate to the achievement of specific hazardous failure rates. Unfortunately this definition of SILs, that ties SILs to a probabilistic metric of failure, contains a fatal flaw.

Continue Reading…

Meltwater river Greenland icecap (Image source: Ian Jouhgin)

Meme’s, media and drug dealer’s

In honour of our Prime Minister’s use of the drug dealer’s argument to justify (at least to himself) why it’s OK for Australia to continue to sell coal, when we know we really have to stop, here’s an update of a piece I wrote on the role of the media in propagating denialist meme’s. Enjoy, there’s even a public heath tip at the end.

PS. You can find Part I and II of the series here.


Technical debt


St Briavels Castle Debtors Prison (Image source: Public domain)

Paying down the debt

A great term that I’ve just come across, technical debt is a metaphor coined by Ward Cunningham to reflect on how a decision to act expediently for an immediate reason may have longer term consequences. This is a classic problem during design and development where we have to balance various ‘quality’ factors against cost and schedule. The point of the metaphor is that this debt doesn’t go away, the interest on that sloppy or expedient design solution keeps on getting paid every time you make a change and find that it’s harder than it should be. Turning around and ‘fixing’ the design in effect pays back the principal that you originally incurred. Failing to pay off the principal? Well such tales can end darkly. Continue Reading…

Inspecting Tacoma Narrows (Image source: Public domain)

We don’t know what we don’t know

The Tacoma Narrows bridge stands, or rather falls, as a classic example of what happens when we run up against the limits of our knowledge. The failure of the bridge due to an as then unknown torsional aeroelastic flutter mode, which the bridge with it’s high span to width ratio was particularly vulnerable to, is almost a textbook example of ontological risk. Continue Reading…

Icicles on the launch tower (Image source: NASA)

An uneasy truth about the Challenger disaster

The story of Challenger in the public imagination could be summed up as ”’heroic’ engineers versus ’wicked’ managers”, which is a powerful myth but unfortunately just a myth. In reality? Well the reality is more complex and the causes of the decision to launch rest in part upon the failure of the participating engineers in the launch decision to clearly communicate the risks involved. Yes that’s right, the engineers screwed up in the first instance. Continue Reading…

Risk managers are the historians of futures that may never be.

Matthew Squair

I’ve rewritten my post on epistemic, aleatory and ontological risk pretty much completely, enjoy.


A tale of another two reactors

There’s been much debate over the years as whether various tolerance of risk approaches actually satisfy the legal principle of reasonable practicability. But there hasn’t to my mind been much consideration of the value of simply adopting the legalistic approach in situations when we have a high degree of uncertainty regarding the likelihood of adverse events. In such circumstances basing our decisions upon what can turn out to be very unreliable estimates of risk can have extremely unfortunate consequences. Continue Reading…


The current Workplace Health and Safety (WHS) legislation of Australia formalises the common law principle of reasonable practicability in regard to the elimination or minimisation of risks associated with industrial hazards. Having had the advantage of going through this with a couple of clients the above flowchart is my interpretation of what reasonable practicability looks like as a process, annotated with cross references to the legislation and guidance material. What’s most interesting is that the process is determinedly not about tolerance of risk but instead firmly focused on what can reasonably and practicably be done. Continue Reading…

Germanwings crash

MH370 underwater search area map (Image source- Australian Govt)

Bayes and the search for MH370

We are now approximately 60% of the way through searching the MH370 search area, and so far nothing. Which is unfortunate because as the search goes on the cost continues to go up for the taxpayer (and yes I am one of those). What’s more unfortunate, and not a little annoying, is that that through all this the ATSB continues to stonily ignore the use of a powerful search technique that’s been used to find everything from lost nuclear submarines to the wreckage of passenger aircraft.  Continue Reading…

Here’s an interesting graph that compares Class A mishap rates for USN manned aviation (pretty much from float plane to Super-Hornet) against the USAF’s drone programs. Interesting that both programs steadily track down decade by decade, even in the absence of formal system safety programs for most of the time (1).

USN Manned Aviation vs USAF Drones

The USAF drone program start out with around the 60 mishaps per 100,000 flight hour rate (equivalent to the USN transitioning to fast jets at the close of the 1940s) and maintains a steeper decrease rate that the USN aviation program. As a result while the USAF drones program is tail chasing the USN it still looks like it’ll hit parity with the USN sometime in the 2040s.

So why is the USAF drone program doing better in pulling down the accident rate, even when they don’t have a formal MIL-STD-882 safety program?

Well for one a higher degree of automation does have comparitive advantages. Although the USN’s carrier aircraft can do auto-land, they generally choose not to, as pilot’s need to keep their professional skills up, and human error during landing/takeoff inevitably drives the mishap rate up. Therefore a simple thing like implementing an auto-land function for drones (landing a drone is as it turns out not easy) has a comparatively greater bang for your safety buck. There’s also inherently higher risks of loss of control and mid air collision when air combat manoeuvring, or running into things when flying helicopters at low level which are operational hazards that drones generally don’t have to worry about.

For another, the development cycle for drones tends to be quicker than manned aviation, and drones have a ‘some what’ looser certification regime, so improvements from the next generation of drone design tend to roll into an expanding operational fleet more quickly. Having a higher cycle rate also helps retain and sustain the corporate memory of the design teams.

Finally there’s the lessons learned effect. With drones the hazards usually don’t need to be identified and then characterised. In contrast with the early days of jet age naval aviation the hazards drone face are usually well understood with well understood solutions, and whether these are addressed effectively has more to do with programmatic cost concerns than a lack of understanding. Conversely when it actually comes time to do something like put de-icing onto a drone, there’s a whole lot of experience that can be brought to bear with a very good chance of first time success.

A final question. Looking at the above do we think that the application of rigorous ‘FAA like’ processes or standards like ARP 4761, ARP 4754 and DO-178 would really improve matters?

Hmmm… maybe not a lot.


1. As a historical note while the F-14 program had the first USN aircraft system safety program (it was a small scale contractor in house effort) it was actually the F/A-18 which had the first customer mandated and funded system safety program per MIL-STD-882. USAF drone programs have not had formal system safety programs, as far as I’m aware.
Continue Reading…

For those of you interested in such things, there’s an interesting thread running over on the Safety Critical Mail List at Bielefeld on software failure. Sparked off by Peter Ladkin’s post over on Abnormal Distribution on the same subject. Whether software can be said to fail and whether you can use the term reliability to describe it is one of those strange attractors about which the list tends to orbit. An interesting discussion, although at times I did think we were playing a variant of Wittgenstein’s definition game.

And my opinion? Glad you asked.

Yes of course software fails. That it’s failure is not the same as the pseudo-random failure that we posit to hardware components is neither here nor there. Continue Reading…

Why we should take the safety performance of small samples with a grain of salt

Safety when expressed quantitatively as the probability of a loss over some unit of exposure, is in effect a proportional rate. This is useful as we can compare the performance of different systems or operations when one has of operating hours, and potentially lots of accidents while another has only a few operating hours and therefore fewer accidents. Continue Reading…


I’ll give you a hint it’s not pretty

Current Australian rail and workplace safety legislation requires that safety risks be either eliminated, or if that’s not possible be reduced, ‘so far as is reasonably practicable’. The intent is to ensure that all reasonable practicable precautions are in place, not to achieve some target level of risk.

There are two elements to what is ‘reasonably practicable’. A duty-holder must first consider what can be done – that is, what is possible in the circumstances for ensuring health and safety. They must then consider whether it is reasonable, in the circumstances to do all that is possible. This means that what can be done should be done unless it is reasonable in the circumstances for the duty-holder to do something less.

Worksafe  Australia

This is a real and intractable problem for standards that determine the degree of effort applied to treat a hazard using an initial assessment of risk (1). Nor can the legislation be put aside through appeals to such formalisms as the ALARP principle, or the invocation of a standard such as AS 61508 (2). In essence if you can do something, regardless of the degree of risk, then something should be done.  Continue Reading…

Seconds to Midnight



An interesting article from The Conversation on the semiotics of the Doomsday clock. Continue Reading…

Screwtape(Image source: end time info)

A short (and possibly evil) treatise on SILs from our guest blogger

May I introduce myself?

The name’s Screwtape, some of you might have heard of me from that short and nasty book by C.S. Lewis. All lies of course, and I would know, about lies that is… baboom tish! Anyway the world has moved on and I’m sure that you’d be completely unsurprised to hear that I’ve branched out into software consulting now. I do find the software industry one that is oh so over-ripe for the plucking of immortal souls, ah but I digress. Your good host has asked me here today to render a few words on the question of risk based safety integrity levels and how to turn such pesky ideals, akin in many ways to those other notions of christian virtue, to your own ends. Continue Reading…

Sharks (Image source: Darren Pateman)

Practical risk management, or why I love living in Australia

We’re into the ninth day of closed beaches here with two large great whites spotted ‘patrolling our shores’, whatever that means. Of course in Australia closed doesn’t actually mean the beaches are padlocked, not yet anyway. We just put a sign up and people can make their own minds up as to whether they wish to run the risk of being bitten. In my books a sensible approach to the issue, one that balances societal responsibility with personal freedom. I mean it’s not like they’re as dangerous as bicycles Continue Reading…

I was cleaning out my (metaphorical) sock drawer and came across this rough guide to the workings of the Australian Defence standard on software safety DEF(AUST) 5679. The guide was written around 2006 for Issue 1 of the standard, although many of the issues it discussed persisted into Issue 2, which hit the streets in 2008.

DEF (AUST) 5679 is an interesting standard, one can see that the authors, Tony Cant amongst them, put a lot of thought into the methodology behind the standard, unfortunately it’s suffered from a failure to achieve large scale adoption and usage.

So here’s my thoughts at the time on how to actually use the standard to best advantage, I also threw in some concepts on how to deal with xOTS components within the DEF (AUST) 5679 framework.

Enjoy 🙂


Or how do we measure the unknown?

The problem is that as our understanding and control of known risks increases, the remaining risk in any system become increasingly dominated by  the ‘unknown‘. The higher the integrity of our systems the more uncertainty we have over the unknown and unknowable residual risk. What we need is a way to measure, express and reason about such deep uncertainty, and I don’t mean tools like Pascalian calculus or Bayesian prior belief structures, but a way to measure and judge ontological uncertainty.

Even if we can’t measure ontological uncertainty directly perhaps there are indirect measures? Perhaps there’s a way to infer something from the platonic shadow that such uncertainty casts on the wall, so to speak. Nassim Taleb would say no, the unknowability of such events is the central thesis of his Ludic Fallacy after all. But I still think it’s worthwhile exploring, because while he might be right, he may also be wrong.

*With apologies to Nassim Taleb.


Well if news from the G20 is anything to go by we may be on the verge of a seismic shift in how the challenge of climate change is treated. Our Prime Ministers denial notwithstanding 🙂


A report issued by the US Chemical Safety Board on Monday entitled “Regulatory Report: Chevron Richmond Refinery Pipe Rupture and Fire,” calls on California to make changes to the way it manages process safety.

The report is worth a read as it looks at various regulatory regimes in a fairly balanced fashion. A strong independent competent regulator is seen as a key factor for success by the reports authors, regardless of the regulatory mechanisms. I don’t however think the evidence is as strong as the report makes out that safety case/goal based safety regimes perform ‘all that better’ than other regulatory regimes. Would have also been nice if they’d compared and contrasted against other industries, like aviation.

Well it was either Crowley or Kylie Minogue given the title of the post, so think yourselves lucky (Image source: Warner Brothers TV)

Sometimes it’s just a choice between bad and worse

If we accept that different types of uncertainty create different types of risk then it follows that we may in fact be able to trade one type of risk for another, and in certain circumstances this may be a preferable option.

Continue Reading…

Midlands hotel

A quick report from sunny Manchester, where I’m attending the IET’s annual combined conference on system safety and cyber security. Day one of the conference proper and I got to be lead off with the first keynote. I was thinking about getting everyone to do some Tai Chii to limber up (maybe next year). Thanks once again to Dr Carl Sandom for inviting me over, it was a pleasure. I just hope the audience felt the same way. 🙂

Continue Reading…

Interesting article on old school rail safety and lessons for the modern nuclear industry. As a somewhat ironic addendum the early nuclear industry safety studies also overlooked the risks posed by large inventories of fuel rods on site, the then assumption being that they’d be shipped off to a reprocessing facility as soon as possible, it’s hard to predict the future. 🙂

Small world update


And in news just to hand, the first Ebola case is reported in the US. It’ll be very interesting to see what happens next, and how much transmission rate is driven by cultural and socio-economic effects…


Dear AGL,

I realise that you are not directly responsible for the repeal of the carbon tax by the current government, and I also realise that we the voting public need to man up and shoulder the responsibility for the government and their actions. I even appreciate that if you did wish to retain the carbon tax as a green surcharge, the current government would undoubtedly act to force your hand.

But really, I have to draw the line at your latest correspondence. Simply stamping the latest bill with “SAVINGS FROM REMOVING THE CARBON TAX” scarcely does the benefits of this legislative windfall justice. You have, I fear, entirely undersold the comprehensive social, moral and economic benefits that accrue through the return of this saving to your customers. I submit therefore for your corporate attention some alternatives slogans:

  • “Savings from removing the carbon tax…you’ll pay for it later”
  • “Savings from removing the carbon tax…buy a bigger air conditioner, you’ll need it”
  • “Savings from removing the carbon tax…we also have a unique coal seam investment opportunity”
  • “Savings from removing the carbon tax, invest in climate change!”
  • “Savings from removing the carbon tax, look up the word ‘venal’, yep that’s you”
  • “Savings from removing the carbon tax, because a bigger flatscreen TV is worth your children’s future”
  • “Savings from removing the carbon tax, disinvesting in the future”

So be brave and take advantage of this singular opportunity to fully invest your corporate reputation in the truly wonderful outcomes of this prescient and clear sighted decision by our federal government.

Yours respectfully


An interesting post by Mike Thicke over at Cloud Chamber on the potential use of prediction markets to predict the location of MH370. Prediction markets integrate ‘diffused’ knowledge using a market mechanism to derive a predicted likelihood, essentially market prices are assigned to various outcomes and are treated as analogs of their likelihood. Market trading then established what the market ‘thinks’ is the value of each outcome. The technique has a long and colourful history, but it does seem to work. As an aside prediction markets are still predicting a No vote in the upcoming referendum on Scottish Independence despite recent polls to the contrary.

Returning to the MH370 saga, if the ATSB is not intending to use a Bayesian search plan then one could in principle crowd source the effort through such a prediction market. One could run the market in a dynamic fashion with the market prices updating as new information comes in from the ongoing search. Any investors out there?

Enshrined in Australia’s current workplace health and safety legislation is the principle of ‘So Far As Is Reasonably Practicable’. In essence SFAIRP requires you to eliminate or to reduce risk to a negligible level as is (surprise) reasonably practicable. While there’s been a lot of commentary on the increased requirements for diligence (read industry moaning and groaning) there’s been little or no consideration of what is the ‘theory of risk’ that backs this legislative principle and how it shapes the current legislation, let alone whether for good or ill. So I thought I’d take a stab at it. 🙂 Continue Reading…