In case you’ve been wondering what I’ve been doing since I said finis to this bog, the answer is, I’ve been writing. In fact I’ve been writing a book, titled, Critical Uncertainties. After several false starts, and putting it down for a bit I’ve got it to a stage that I’m ready to share a draft of it. There’s a couple of of the practice chapters to go, still working human factors, and I may add a safety case chapter, but at this point I’d be interested in feedback.

Update 15 June: Draft chapter on Off the Shelf added.

Update 4 Oct: First draft completed, phew! Now onto the first edit.

Here in one handy post are the various criticisms I’ve made of ATAGI’s performance:

Really quite depressing that a blue ribbon committee such as this couldn’t see it’s way to getting expert advice on the one thing it was there to provide expert advice on, risk.

ATAGI provided an estimate of the risk associated with TTS events (bad post vaccination events). This was at the time based on the evidence that was available. The problem with this is that we run into de Moivre’s `Law of Large Numbers’. Which in essence says we can expect small samples to have much greater variability than larger sample sizes. As a result when tracking vaccine side effects you can expect the estimated rate of occurrence to be all over the shop initially because one incident against a very low number of vaccinations (our sample) can skew the incident rate a lot.

To give you a feel for this effect our friend de Moivre found that the size of a typical discrepancy (variance) goes up proportional to the square root of the number of samples. As we divide that number by the total number of samples to get our proportional rate the proportional discrepancy is going to increase as our sample gets smaller. It’s a bit like throwing a small rock into a small pool (big splash proportionally) and then retreiving it and tossing it into a big lake (small small splash proportionally).

So what ATAGI did wrong with their initial estimate of TTS was not considering that for a very rare event like TTS the population needed to dial back the variance associated with a proportionally small size sample was going to big, and the needed to account for it. But they didn’t. Instead the just threw up the raw frequencies in the risk assessment. Of course the world has moved on and it turns out that the initial TTS estimates totally over estimated the rate of TTS. Well I guess a 17th century mathematician still has a few things to teach AGATI.

So in a previous post I outlined why the risk comparison that ATAGI purported to perform was fatally flawed. But unfortunately it’s worse than that.

ATAGI’s risk comparison is based on a side by side comparison of TTS occurrence rates agains Covid 19 deaths for a specific age cohort. But this is only valid if all TTS result in death. On the face of it that didn’t seem right, so I went and pulled the prior ATAGI advice and as it turns out, based on their data, the risk of death is 3% if you get a clot. So the actual risk of death due to TTS risk is 0.081 per 100,000 in the age group 50-59. Compared to the estimated risk of Covid 19 death for the age group 50-59 in a moderate outbreak of 0.1 per 100,000. Close enough to say the risk is equivalent.

Putting is simply and bluntly ATAGI managed to overestimate the risk of TTS by two orders of magnitude. That overestimate runs all the way through the rest of their risk estimates heavily skewing it against Astra Zeneca. Why did such a basic error occur? Well it’s pretty much an open secret there’s been an ongoing internal factional fights within ATAGI regarding Pfizer versus Astra Zeneca and I’d surmise that when whoever put this risk assessment together got the number they expected (wanted to see) and didn’t bother to check it.

If ATAGI had just checked the damn numbers they wouldn’t have made the stupid error that they did, they wouldn’t then have issued the advice that sent the Commonwealth Government into a flat-spin and led to the trashing of Astra Zeneca. Astra Zeneca being the only vaccine we had much of at that time. Slow hand clap for ATAGI.

In an alternate universe of course someone did pick up the error, the advise remained unchanged, that limousine driver in Sydney did get his Astra Zeneca jab and the Delta outbreak in Sydney never took off. I’d really like to live in that universe wouldn’t you?

### Or how not to do risk assessments

To set the scene. Here in Australia there is a a group called the Australian Technical Advisory Group (ATAGI) that provides advice to the federal government’s health Minister on the safety and use of vaccines. Part of their job during the pandemic has been advising on how the various vaccines should be rolled out, to what age groups and so on. ATAGI originally recommended that for 50 years and under that Pfizer was preferred against Astra Zeneca due to clotting risks in younger individuals and that, “given there is currently no or limited community transmission in Australia” we could afford to wait for Pfizer to turn up. On the 17th June 2021 they updated their advice and extended their recommendation to the 50 to 69 age cohort due to several incidents of thrombosis and thrombocytopenia syndrome, TTS for short. Again they again reaffirmed that it was based on no or limited transmission in Australia.

So what is wrong with this advice? Well quite a lot, one of the principle rules of risk assessment is that you should always be careful to to compare risks that are equivalent, technically this is called mensurability. So can we compare the risk of getting Covid 19 with that of having a blood clot as a result go a vaccine? Well, let’s start with another example drawn from earlier in this pandemic. Early on the lockdown sceptics were pointing out that your risk of drowning in a pool, in California, was much higher than that of dying from Covid 19 so why to worry? If you feel this is intuitively wrong, in fact wronger than wrong, then yes you’re on the right track. The two probabilities that underpin these risks are in fact radically different. In the first case if I drown in my pool it’s not going to have any affect on the probability of my neighbour drowning, we can say these events are independent and so are their probabilities. But, on the other hand if I contract Covid 19 you’ll find that the probability of my neighbour also getting Covid 19 is actually dependent on the probability that others (including me) are infected. In the first scenario we can truck along with a constant rate (and risk) of drowning events each unaffected by each other, but in our second scenario the events can affect each other’s probabilities and the risk can suddenly blow up. Thus these two risks are fundamentally immensurable because of the underlying difference between dependent and independent probabilities. In the case of the ATAGI estimate they compared exactly these two different types of probability, and risk, and made exactly the same error. That is the risk of clotting is clearly independent while that of dying from Covid 19 is very much dependent, just like our prior example these are not the same sort of risks and just like our prior example you cannot compare them.

So could ATAGI retrieve their risk assessment? In order to do that they would have to meaningfully assess an individuals risk while they wait, patiently, for Pfizer. We can in theory apply what’s called Pascal’s parallel worlds theory to the problem. That is we imagine all the different parallel worlds (or scenarios) in which an individual might be infected and the consequences as well as those in which they are not infected and based on the relative proportions of outcomes we can assign probabilities. This all sounds fine except for the small problem, called the ergodic fallacy, that we don’t actually live in these parallel worlds, you and I and everyone else live our lives on a single timeline where getting Covid can kill us and it doesn’t matter that a thousand other ‘us’ in other timelines have not. Where there is risk of ruin these sort of exercises can and do deliver alarming underestimates of risk. We might for example consider the young woman who recently died of Covid 19 in Sydney and whether, if knowing what lay ahead, she would have preferred to have take Astra Zeneca (2). I think perhaps so. So no, from an individual’s perspective such prognostication about the future cannot rescue ATAGI’s assessment. As N. Taleb points out when there’s risk of ruin in the house we shouldn’t try to estimate the probability of such uncertain and unknowable events we should just focus on eliminating the risk.

Adding further to the problem when ATAGI actually got around to performing a cost versus benefit study (ATAGI advice 21 June 2021) they compared the rate of TTS against the rate of hospitalisations and deaths for Covid 19. But this makes no sense as a TTS is an event that can have a range of outcomes, so their risk assessment is misleading in that it’s comparing a naive event rate on the one hand with the various rates of loss outcomes on the other. Nowhere in their study do they identify exact what are the probabilities of death or ICU stays if a TTS does occur, which on the face of it skews the risk assessment and make TTS appear to be a bigger risk than it is (2). Further compounding the confusion, ATAGI then go on to point out that their probability estimates for TTS are uncertain as they’re based on a small number of people vaccinated with Astra Zeneca under 50 in Australia (3). Well OK, but in that case would it be too much to ask for a confidence interval on their estimate? Or to explicitly compare it to international data?

The result of ATAGI’s sloppy thinking and analysis is that it encouraged the ‘waiting for Pfizer’ mindset even in those who were eligible for Astra Zeneca according to ATAGI’s own recommendations. And as we all now know it was one unvaccinated limousine driver, who’d declined Astra Zeneca because he was waiting for his Pfizer, that sparked off the current Delta outbreak in Australia.

Here we are, thanks in no small measure to ATAGI.

Notes

1. Also in the age group 18-49 their estimating zero deaths due to Covid 19 are zero for a moderate outbreak. Now that can’t be right as we’ve had deaths already in this age cohort. It’s rare, perhaps not so rare with Delta, but it does happen. One meta-study estimates a median Infection Fatality Rate of 0.002% at age 10 and 0.01% at age 25. How they arrived at zero is a mystery, risk doesn’t just magically go to zero even for a moderate outbreak. So again it appears that AGATI are underplaying the risks of Covid 19 for the younger cohorts.

2. I went and pulled the prior ATAGI advice and the risk of death is 3% if you get a clot. So the risk of death due to TTS risk is 0.081 per 100,000 in the age group 50-59. Compared to their risk of death in a moderate outbreak of 0.1 per 100,000. I mean really ATAGI?

3. Where there is uncertainty in an estimator we should do more than just point at it and say look, it’s uncertain. And there are well tried statistical techniques that can do just that.

Imperial College London just  updated their report on interventions (other than pharmacological) to reduce death rates and prevent the health care system being overwhelmed. The news is not good.

They first modelled traditional mitigation strategies that seek to slow but not stop the spread, e.g. flatten the curve, for Great Britain and the United States. For an unmitigated epidemic they found you’d end up with 510,000 deaths in Great Britain, not accounting for health systems being overwhelmed on mortality. Even with a full optimised set of mitigations in place they found that this would only reduce peak critical care demand by two-thirds and halve the number of deaths. Yet this ‘optimal’ scenario would also still result in an 8-fold higher peak demand on critical care beds over available capacity.

Their conclusion? That epidemic suppression is the only viable strategy at the current time. This has profound implications for Australia which still appears to be on a mitigation path. First, even if we do our very best there will be a reduction of at best 50% in the death rate. This translates to on the order of at least 100,000 deaths. To put that in context that’s more than Australia lost in two world wars. The associated number of sick patients would undoubtedly also overwhelm our critical care system.

The only viable alternative Imperial College identified was to act to suppress the epidemic, e.g. to reduce R (the reproductive number) to close to 1 or below. To do so would need a combination of strict case isolation, population level social distancing, household quarantines and/or school and university closure. This suppression would need to be in place for at least five months. Having supressed it a combination of rigorous case isolation and contact tracing would (hopefully) then be able deal with subsequent outbreaks.

However Australia is not doing this, the Prime Minister has made this very plain, he’s not going to, ‘turn Australia off and then back on again’. We also seem to be underestimating the numbers (see this Guardian article on NSW Health’s estimates). So in the absence of the State Governments breaking ranks we are now on an express train ride (see chart below) to a national disaster of epic proportions. Jesus.

This will be the last post on this website, so if you want to grab some of the media available under useful stuff feel free.

Well, as someone said, because it’s the worst of social media, combined with the worst of corporate culture and the worst of website design. Because dealing with it regularly is as interesting as cleaning out my sock draw, and because the tone, like an endless ritalin fuelled rotary meeting, is just plain unhealthy. The philosopher Kant once said that you should always treat human beings as ends in themselves, and never as just the means to an end. Well LinkedIn for crimes against the categorical imperative alone you have to go…

So ten years on from the Black Saturday fire we’re taking a national moment to remember the unstinting heroism displayed in the face of the hell that was Black Saturday day and all that was lost. Of course we’re not quite so diligent in remembering that we haven’t prevented people rebuilding in high risk areas, nor in improving the fire resistance of people’s homes, nor yet in managing down the risk through realistic burn off programs, nor for that matter have we noticed that the fuel burden in 2018 is back up to the same levels as it was ten years ago. So perhaps instead we should reflect on how we’ve squandered the opportunity for reform. And perhaps we should remember that a fire will come again, to burn our not so clever country .

## David Collins

The deadline for you to opt out of the government’s ill advised national health record system is rapidly approaching, and for the record yes I have opted out. I’ll give you a concrete example of what I’m talking about when I say ‘ill advised’, currently it’s assumed that you’ll be OK to share your anonymised medical data for research purposes by setting sharing it as a default. This is despite it being shown time and time again that the anonymisation of such data just doesn’t work. You might share my concerns about this lack of concern and level of indifference to the idea of informed consent. What the agencies of the state clearly don’t get is that this this information belongs to you and I, it doesn’t belong to my doctor, your medical data is yours your doctor holds it in trust for you. And until the state demonstrates a clear and unequivocal understanding of that point I say no thanks and I’d invite you all to do the same. My Health Record? Not so much.

PS. The architect of My Health Record is Tim Kelsey, yes that same Tim Kelsey who presided over the UK Government’s Care.data, program which tanked over sharing data without explicit consent. And unfortunately for us that attitude is baked into My Health Record’s DNA.

PPS. To me the carelessness of of the government in this whole affair is indicative of the increasingly totalitarian relationship between the government and the people.

## Dr Phil Koopman (on driverless cars)

Here’s a view from inside Tesla by one of it’s former employees. Taking the report at face value, which is of course an arguable proposition, you can see how technical debt can build up to a point where it’s near impossible to pay it down. That in turn can have significant effects on the safety performance of the organisation, see the Toyota spaghetti code case as another example. The take home for this is for any software safety effort it’s a good idea to see whether the company/team is measuring technical debt in a meaningful fashion and are actively retiring it, for example by alternating capability and maintenance updates.

Tesla and technical debt.

## Earl Wiener (1980)

If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.

When you look at the safety performance of industries which have a consistent focus on safety as part of the social permit, nuclear or aviation are the canonical examples, you see that over time increases in safety tend to plateau out. This looks like some form of a learning curve, but what’s the mechanism, or mechanisms that actually drives this process? I believe there are two factors at play here, firstly the increasing marginal cost of improvement and secondly the problem of learning from events that we are trying to prevent.

Increasing marginal cost is simply an economist’s way of stating that it will cost more to achieve that next increment in performance. For example, airbags are more expensive than seat-belts by roughly an order of magnitude (based on replacement costs) however airbags only deliver 8% reduced mortality when used in conjunction with seat belts, see Crandall (2001). As a result the next increment in safety takes longer and costs more (1).

The learning factor is in someways like an informational version of the marginal cost rule. As we reduce accident rates accidents become rarer. Now one of the traditional ways in which safety improvements occur is through studying accidents when they occur and then to eliminate or mitigate identified causal factors. Obviously as the accident rate decreases this likewise the opportunity for improvement also decreases. When accidents do occur we have a further problem because (definitionally) the cause of the accident will comprise a highly unlikely combination of factors that are needed to defeat the existing safety measures. Corrective actions for such rare combination of events therefore are highly specific to that event’s context and conversely will have far less universal applicability.  For example the lessons of metal fatigue learned from the Comet airliner disaster has had universal applicability to all aircraft designs ever since. But the QF-72 automation upset off Learmouth? Well those lessons, relating to the specific fault tolerance architecture of the A330, are much harder to generalise and therefore have less epistemic strength.

In summary not only does it cost more with each increasing increment of safety but our opportunity to learn through accidents is steadily reduced as their arrival rate and individual epistemic value (2) reduce.

#### Notes

1. In some circumstances we may also introduce other risks, see for example the death and severe injury caused to small children from air bag deployments.

2. In a Popperian sense.

#### References

1. Crandall, C.S., Olson, L.M.,  P. Sklar, D.P., Mortality Reduction with Air Bag and Seat Belt Use in Head-on Passenger Car Collisions, American Journal of Epidemiology, Volume 153, Issue 3, 1 February 2001, Pages 219–224, https://doi.org/10.1093/aje/153.3.219.

Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.

The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470  this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.

As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).

One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.

I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?

Notes

1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.

2. No go alone rules are unfortunately only partially effective.

References

Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.

## Bath Iron Works Corporation Report 1995

One of the things that they don’t teach you at University is that as an engineer you will never have enough time. There’s never the time in the schedule to execute that perfect design process in your head which will answer all questions, satisfy all stakeholders and optimises your solution to three decimal places. Worse yet you’re going to be asked to commit to parts of your solution in detail before you’ve finished the overall design because for example, ‘we need to order the steel bill now because there’s a 6 month lead time, so where’s the steel bill?’. Then there’s dealing with the ‘internal stakeholders’ in the design process, who all have competing needs and agendas. You generally end up with the electrical team hating the mechanicals, nobody talking to structure and everybody hating manufacturing (1).

So good engineering managers, (2) spend a lot of their time managing the risk of early design commitment and the inherent concurrency of the program, disentangling design snarls and adjudicating turf wars over scarce resources (3). Get it right and you’re only late by the usual amount, get it wrong and very bad things can happen. Strangely you’ll not find a lot of guidance in traditional engineering education on these issues, but for my part what I’ve found to be helpful is a pragmatic design process that actually supports you in doing the tough stuff (4). Oh and being able to manage outsourcing of design would also be great (5). This all gets even more difficult when you’re trying to do vehicle design, which I liken to trying to stick 5 litres of stuff into a 4 litre container. So at the link below is my architecting approach to managing at least part of the insanity. I doubt that there’ll ever be a perfect answer to this, far too too many constraints, competing agenda’s and just plain cussedness of human beings. But if your last project was anarchy dialled up-to eleven it might be worth considering some of these possible approaches. Hope it helps, and good luck!

### Notes

1. It is a truth universally acknowledged that engineers are notoriously bad at communicating.

3. Such as, cable and piping routes, whose sensors goes at the top of mast, mass budgets, and power constraints. I’m sure we’ve all been there.

4. My observation is that (some) engineers tend to design their processes to be perfect and conveniently ignore the ugly messiness of the world, because they are uncomfortable with  being accountable for decisions made under uncertainty. Of course when you can’t both follow these processes and get the job done these same engineers will use this as a shield from all blame, e.g. ‘If you’d only followed our process..’ say they, `sure…’ say I.

5. A traditional management ploy to reduce costs, but rarely does management consider that you then need to manage that outsourced effort which takes particular mix of skills. Yet another kettle of dead fish as Boeing found out on the B787.

### Reference

So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?

In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:

1. Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
2. Provide a fallback AoA source using an AoA estimator.
3. Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
4. Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
5. Use non-aristotlean logic to better model the trustworthiness of air data.
6. Provide the current master/slave hierarchy status amongst voting channels to aircrew.
7. Provide an obvious and intuitive way to  to remove a faulted channel allowing flight under reversionary laws (7).
8. Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).

As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.

#### Notes

1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.

2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.

3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.

4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).

5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.

6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.

7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.

8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.

One of the perennial problems we face in a system safety program is how to come up with a convincing proof for the proposition that a system is safe. Because it’s hard to prove a negative (in this case the absence of future accidents) the usual approach is to pursue a proof by contradiction, that is develop the negative proposition that the system is unsafe, then prove that this is not true, normally by showing that the set of identified specific propositions of `un-safety’ have been eliminated or controlled to an acceptable level.  Enter the term `hazard’, which in this context is simply shorthand for  a specific proposition about the unsafeness of a system. Now interestingly when we parse the set of definitions of hazard we find the recurring use of terms like, ‘condition’, ‘state’, ‘situation’ and ‘events’ that should they occur will inevitably lead to an ‘accident’ or ‘mishap’. So broadly speaking a hazard is a explanation based on a defined set of phenomena, that argues that if they are present, and given there exists some relevant domain source (1) of hazard an accident will occur. All of which seems to indicate that hazards belong to a class of explanatory models called covering laws. As an explanatory class Covering laws models were developed by the logical positivist philosophers Hempel and Popper because of what they saw as problems with an over reliance on inductive arguments as to causality.

As a covering law explanation of unsafeness a hazard posits phenomenological facts (system states, human errors, hardware/software failures and so on) that confer what’s called nomic expectability on the accident (the thing being explained). That is, the phenomenological facts combined with some covering law (natural and logical), require the accident to happen, and this is what we call a hazard. We can see an archetypal example in the Source-Mechanism-Outcome model of Swallom, i.e. if we have both a source and a set of mechanisms in that model then we may expect an accident (Ericson 2005). While logical positivism had the last nails driven into it’s coffin by Kuhn and others in the 1960s and it’s true, as Kuhn and others pointed out, that covering model explanations have their fair share of problems so to do other methods (2). The one advantage that covering models do possess over other explanatory models however is that they largely avoid the problems of causal arguments. Which may well be why they persist in engineering arguments about safety.

### Notes

1. The source in this instance is the ‘covering law’.

2. Such as counterfactual, statistical relevance or causal explanations.

### References

Ericson, C.A. Hazard Analysis Techniques for System Safety, page 93, John Wiley and Sons, Hoboken, New Jersey, 2005.

Here’s a working draft of the introduction and first chapter of my book…. Enjoy 🙂

We are hectored almost daily basis on the imminent threat of islamic extremism and how we must respond firmly to this real and present danger. Indeed we have proceeded far enough along the escalation of response ladder that this, presumably existential threat, is now being used to justify talk of internment without trial. So what is the probability that if you were murdered, the murderer would be an immigrant terrorist?

In NSW in 2014 there were 86 homicides, of these 1 was directly related to the act of a homegrown islamist terrorist (1). So there’s a 1 in 86 chance that in that year if you were murdered it was at the hands of a mentally disturbed asylum seeker (2). Hmm sounds risky, but is it? Well there was approximately 2.5 million people in NSW in 2014 so the likelihood of being murdered (in that year) is in the first instance 3.44e-5. To figure out what the likelihood of being murdered and that murder being committed by a terrorist  we just multiply this base rate by the probability that it was at the hands of a `terrorist’, ending up with 4e-7 or 4 chances in 10 million that year. If we consider subsequent and prior years where nothing happened that likelihood becomes even smaller.

Based on this 4 in 10 million chance the NSW government intends to build a super-max 2 prison in NSW, and fill it with ‘terrorists’ while the Federal government enacts more anti-terrorism laws that take us down the road to the surveillance state, if we’re not already there yet. The glaring difference between the perception of risk and the actuality is one that politicians and commentators alike seem oblivious to (3).

Notes

1. One death during the Lindt chocolate siege that could be directly attributed to the `terrorist’.

2. Sought and granted in 2001 by the then Liberal National Party government.

3. An action that also ignores the role of prisons in converting inmates to Islam as a route to recruiting their criminal, anti-social and violent sub-populations in the service of Sunni extremists.

The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.

Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.

“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”

Capt. Kevin Sullivan (QF72 flight)

From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.

“Damn-it! We’re going to crash. It can’t be true! (copilot #1)

“But, what’s happening? copilot #2)

AF447 CVR transcript (final words)

Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident.  In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.

For those of you who might be wondering at the lack of recent posts I’m a little pre-occupied at the moment as I’m writing a book. Hope to have a first draft ready in July. ; )

With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from this post written in the aftermath of the 2009 Victorian Black Saturday fires.

So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…

So how did we get to this?  Simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the occurrence rate of extremely dry conditions and the heavy tail of bushfire severity. Yes we’ve been twisting the dragon’s tail and now it’s woken up…

2019 Postscript: Monday 11 November 2019 – NSW

And here we are in 2019 two years down the track from the fires of 2017 and tomorrow looks like being a beyond catastrophic fire day. Firestorms are predicted.

How algorithm can kill…

So apparently the Australian Government has been buying it’s software from Cyberdyne Systems, or at least you’d be forgiven for thinking so given the brutal treatment Centerlink’s autonomous debt recovery software has been handing out to welfare recipients who ‘it’ believes have been rorting the system. Yep, you heard right it’s a completely automated compliance operation (well at least the issuing part).  Continue Reading…

A recent workplace health and safety case in Australia has emphasised that an employer does not have to provide training for tasks that are considered to be ‘relatively’ straight forward. The presiding judge also found that while changes to the workplace  could in theory be made, in practice it would be unreasonable to demand that the employer make such changes. The judge’s decision was subsequently upheld on appeal.

What’s interesting is the close reasoning of the court (and the appellate court) to establish what is reasonable and practicable in the circumstances. While the legal system is not perfect it does have a long standing set of practices and procedures for getting at the truth. Perhaps we may be able to learn something from the legal profession when thinking about the safety of critical systems. More on this later.

Cowie v Gungahlin Veterinary Services Pty Ltd [2016] ACTSC 311 (25 October 2016)

Second part of the SBS documentary on line now. Looking at the IoT this episode.

Cyberwar documentary now running on SBS with a good breakdown of the Stuxnet malware courtesy of the boys at Symantec. Thank you NSA, once again, for the bounty of Stuxnet… Yes, indeed thank you.

Image source: AP/LM Otero

A Trump presidency in the wings who’d have thought! And what a total shock it was to all those pollsters, commentators and apparatchiks who are now trying to explain why they got it so wrong. All of which is a textbook example of what students of risk theory call a Black Swan event. Continue Reading…

## Iain Banks

### More infernal statistics

Well, here we are again. Given recent developments in the infernal region it seems like a good time for another post. Have you ever, dear reader, been faced with the problem of how to achieve an unachievable safety target? Well worry no longer! Herewith is Screwtape’s patented man based mitigation medicine.

The first thing we do is introduce the concept of ‘mitigation’, ah what a beautiful word that is. You see it’s saying that it’s OK that your system doesn’t meet its safety target, because you can claim credit for the action of an external mitigator in the environment. Probability wise if the probability of an accident is P_a then P_a equals the product of your systems failure probability P_s and. the probability that some external mitigation also fails P_m or P_a = P_s X P_m.

So let’s use operator intervention as our mitigator, lovely and vague. But how to come up with a low enough P_m? Easy, we just look at the accident rate that has occurred for this or a like system and assume that these were due to operator mitigation being unsuccessful. Voila, we get our really small numbers.

Now, an alert reader might point out that this is totally bogus and that P_m is actually the likelihood of operator failure when the system fails. Operators failing, as those pestilential authors of the WASH1400 study have pointed out, is actually quite likely. But I say, if your customer is so observant and on the ball then clearly you are not doing your job right. Try harder or I may eat your soul, yum yum.

Yours hungrily,

Screwtape.

## Graham Long

### A clank of botnets

More bad news for the Internet this week as a plague of BotNets launched a successful wave of denial of service attacks on Dyn, a dynamic domain name service provider. The attacks on Dyn propagated through to services such as Twitter (OK no great loss), Github, The Verge, Playstation Network, Box and Wix. Continue Reading…

Well hello there, it’s been a while hasn’t it?

In the absence of our good host I thought I’d just pop in and offer some advice on how to use statistics for requirements compliance. Now of course what I mean by requirements compliance is that ticklish situation where the customer has you over the proverbial barrel with an eye-gouger of a requirement. What to do, what to do. Well dear reader all is not lost, what one can do is subtly rework the requirement right in front of the customer without them even recognising it…

No! I hear you say, ‘how can this wonder be achieved Screwtape?’

Well it’s really quite simple, when one understands that requirements are to a greater or lesser extent ‘operationally’ defined by their method of verification. That means that just as requirements belong to the customer so too should the method one uses to demonstrate that you’ve met them. Now if you’re in luck the customer doesn’t realise this, so you propose adopting a statistical proof  of compliance, throw in some weaselling about process variability, based on the median of a sample of tests. Using the median is important as it’s more resistant to outlier values, which is what we want to obfuscate (obviously). As the method of verification defines the requirement all of a sudden you’ve taken the customer’s deterministic requirement and turned it into a weaker probabilistic one. Even better you now have psychological control over half of the requirement, ah the beauty of psychological framing effects.

Now if you’ll excuse me all this talk of statistics has reminded me that I have some souls to reap over at the Australian Bureau of Statistics*.Mmm, those statisticians, their souls are so dry and filled with tannin, just like a fine pinot noir.

Till the next time. Yours infernally,

Screwtape

*Downstairs senior management were not amused by having to fill out their name and then having a census checker turn up on their doorstep asking whether they were having a lend of the ABS.

Accidents of potentially catastrophic potential pose a particular challenge to classical utilitarian theories of managing risk. A reader of this blog might be aware of how the presence of possibility of irreversible catastrophic outcomes (i.e. non-ergodicity) undermines a key assumption on which classical risk assessment is based. But what to do about it? Well one thing we can practically do is to ensure that when we assess risk we take into account the irreversible (non-ergodic) nature of such catastrophes and there are good reasons that we should do so, as the law does not look kindly on organisations (or people) who make decisions about risk of death purely on the basis of frequency gambling.

A while ago I put together a classical risk matrix (1) that treated risk in accordance with De Moivre’s formulation and I’ve modified this matrix to explicitly address non-ergodicity. The modification is to the extreme (catastrophic) severity column where I’ve shifted the boundary of unacceptable risk downwards to reflect that the (classical) iso-risk contour in that catastrophic case under-estimates the risk posed by catastrophic irreversible outcomes. The matrix now also imposes claim limits on risk where a SPOF may exist that could result in a catastrophic loss (2). We end up with something that looks a bit like the matrix below (3).

From a decision making perspective you’ll note that not only is the threshold for unacceptable risk reduced but that for catastrophic severity (one or more deaths) there is no longer a ‘acceptable’ threshold. This is an important consideration reflecting as it does the laws position that you cannot in gamble away your duty of care, e.g justify not taking an action purely the basis of a risk threshold (4).  The final outcome of this work, along with revised likelihood and severity definitions, can be found in hazard matrix V1.1 (5). I’m still thinking about how you might introduce more consideration of epistemic and ontological risks into the matrix, it’s a work in progress.

### Notes

1. Mainly to provide a canonical example of what a well constructed matrix should look like as there are an awful lot of bad ones floating around.

2. You have to either eliminate the SPOF or reduce the severity. There’s an implied treatment of epistemic uncertainty in such a claim limit that I find appealing.

3. The star represents a calibration point that’s used when soliciting subjective assessments of likelihood from SME.

4.  By the way you’re not going to find these sort of considerations in ISO 31000.

5. Important note. like all risk matrices it needs to be calibrated to the actual circumstances and risk appetite of the organisation. No warranty given and YMMV.