### Archives For Uncategorized

Latest drop of the work in progress. You can find a review copy at the link.

Here in one handy post are the various criticisms I’ve made of ATAGI’s performance:

Really quite depressing that a blue ribbon committee such as this couldn’t see it’s way to getting expert advice on the one thing it was there to provide expert advice on, risk.

ATAGI provided an estimate of the risk associated with TTS events (bad post vaccination events). This was at the time based on the evidence that was available. The problem with this is that we run into de Moivre’s `Law of Large Numbers’. Which in essence says we can expect small samples to have much greater variability than larger sample sizes. As a result when tracking vaccine side effects you can expect the estimated rate of occurrence to be all over the shop initially because one incident against a very low number of vaccinations (our sample) can skew the incident rate a lot.

To give you a feel for this effect our friend de Moivre found that the size of a typical discrepancy (variance) goes up proportional to the square root of the number of samples. As we divide that number by the total number of samples to get our proportional rate the proportional discrepancy is going to increase as our sample gets smaller. It’s a bit like throwing a small rock into a small pool (big splash proportionally) and then retreiving it and tossing it into a big lake (small small splash proportionally).

So what ATAGI did wrong with their initial estimate of TTS was not considering that for a very rare event like TTS the population needed to dial back the variance associated with a proportionally small size sample was going to big, and the needed to account for it. But they didn’t. Instead the just threw up the raw frequencies in the risk assessment. Of course the world has moved on and it turns out that the initial TTS estimates totally over estimated the rate of TTS. Well I guess a 17th century mathematician still has a few things to teach AGATI.

So in a previous post I outlined why the risk comparison that ATAGI purported to perform was fatally flawed. But unfortunately it’s worse than that.

ATAGI’s risk comparison is based on a side by side comparison of TTS occurrence rates agains Covid 19 deaths for a specific age cohort. But this is only valid if all TTS result in death. On the face of it that didn’t seem right, so I went and pulled the prior ATAGI advice and as it turns out, based on their data, the risk of death is 3% if you get a clot. So the actual risk of death due to TTS risk is 0.081 per 100,000 in the age group 50-59. Compared to the estimated risk of Covid 19 death for the age group 50-59 in a moderate outbreak of 0.1 per 100,000. Close enough to say the risk is equivalent.

Putting is simply and bluntly ATAGI managed to overestimate the risk of TTS by two orders of magnitude. That overestimate runs all the way through the rest of their risk estimates heavily skewing it against Astra Zeneca. Why did such a basic error occur? Well it’s pretty much an open secret there’s been an ongoing internal factional fights within ATAGI regarding Pfizer versus Astra Zeneca and I’d surmise that when whoever put this risk assessment together got the number they expected (wanted to see) and didn’t bother to check it.

If ATAGI had just checked the damn numbers they wouldn’t have made the stupid error that they did, they wouldn’t then have issued the advice that sent the Commonwealth Government into a flat-spin and led to the trashing of Astra Zeneca. Astra Zeneca being the only vaccine we had much of at that time. Slow hand clap for ATAGI.

In an alternate universe of course someone did pick up the error, the advise remained unchanged, that limousine driver in Sydney did get his Astra Zeneca jab and the Delta outbreak in Sydney never took off. I’d really like to live in that universe wouldn’t you?

### Or how not to do risk assessments

To set the scene. Here in Australia there is a a group called the Australian Technical Advisory Group (ATAGI) that provides advice to the federal government’s health Minister on the safety and use of vaccines. Part of their job during the pandemic has been advising on how the various vaccines should be rolled out, to what age groups and so on. ATAGI originally recommended that for 50 years and under that Pfizer was preferred against Astra Zeneca due to clotting risks in younger individuals and that, “given there is currently no or limited community transmission in Australia” we could afford to wait for Pfizer to turn up. On the 17th June 2021 they updated their advice and extended their recommendation to the 50 to 69 age cohort due to several incidents of thrombosis and thrombocytopenia syndrome, TTS for short. Again they again reaffirmed that it was based on no or limited transmission in Australia.

So what is wrong with this advice? Well quite a lot, one of the principle rules of risk assessment is that you should always be careful to to compare risks that are equivalent, technically this is called mensurability. So can we compare the risk of getting Covid 19 with that of having a blood clot as a result go a vaccine? Well, let’s start with another example drawn from earlier in this pandemic. Early on the lockdown sceptics were pointing out that your risk of drowning in a pool, in California, was much higher than that of dying from Covid 19 so why to worry? If you feel this is intuitively wrong, in fact wronger than wrong, then yes you’re on the right track. The two probabilities that underpin these risks are in fact radically different. In the first case if I drown in my pool it’s not going to have any affect on the probability of my neighbour drowning, we can say these events are independent and so are their probabilities. But, on the other hand if I contract Covid 19 you’ll find that the probability of my neighbour also getting Covid 19 is actually dependent on the probability that others (including me) are infected. In the first scenario we can truck along with a constant rate (and risk) of drowning events each unaffected by each other, but in our second scenario the events can affect each other’s probabilities and the risk can suddenly blow up. Thus these two risks are fundamentally immensurable because of the underlying difference between dependent and independent probabilities. In the case of the ATAGI estimate they compared exactly these two different types of probability, and risk, and made exactly the same error. That is the risk of clotting is clearly independent while that of dying from Covid 19 is very much dependent, just like our prior example these are not the same sort of risks and just like our prior example you cannot compare them.

So could ATAGI retrieve their risk assessment? In order to do that they would have to meaningfully assess an individuals risk while they wait, patiently, for Pfizer. We can in theory apply what’s called Pascal’s parallel worlds theory to the problem. That is we imagine all the different parallel worlds (or scenarios) in which an individual might be infected and the consequences as well as those in which they are not infected and based on the relative proportions of outcomes we can assign probabilities. This all sounds fine except for the small problem, called the ergodic fallacy, that we don’t actually live in these parallel worlds, you and I and everyone else live our lives on a single timeline where getting Covid can kill us and it doesn’t matter that a thousand other ‘us’ in other timelines have not. Where there is risk of ruin these sort of exercises can and do deliver alarming underestimates of risk. We might for example consider the young woman who recently died of Covid 19 in Sydney and whether, if knowing what lay ahead, she would have preferred to have take Astra Zeneca (2). I think perhaps so. So no, from an individual’s perspective such prognostication about the future cannot rescue ATAGI’s assessment. As N. Taleb points out when there’s risk of ruin in the house we shouldn’t try to estimate the probability of such uncertain and unknowable events we should just focus on eliminating the risk.

Adding further to the problem when ATAGI actually got around to performing a cost versus benefit study (ATAGI advice 21 June 2021) they compared the rate of TTS against the rate of hospitalisations and deaths for Covid 19. But this makes no sense as a TTS is an event that can have a range of outcomes, so their risk assessment is misleading in that it’s comparing a naive event rate on the one hand with the various rates of loss outcomes on the other. Nowhere in their study do they identify exact what are the probabilities of death or ICU stays if a TTS does occur, which on the face of it skews the risk assessment and make TTS appear to be a bigger risk than it is (2). Further compounding the confusion, ATAGI then go on to point out that their probability estimates for TTS are uncertain as they’re based on a small number of people vaccinated with Astra Zeneca under 50 in Australia (3). Well OK, but in that case would it be too much to ask for a confidence interval on their estimate? Or to explicitly compare it to international data?

The result of ATAGI’s sloppy thinking and analysis is that it encouraged the ‘waiting for Pfizer’ mindset even in those who were eligible for Astra Zeneca according to ATAGI’s own recommendations. And as we all now know it was one unvaccinated limousine driver, who’d declined Astra Zeneca because he was waiting for his Pfizer, that sparked off the current Delta outbreak in Australia.

Here we are, thanks in no small measure to ATAGI.

Notes

1. Also in the age group 18-49 their estimating zero deaths due to Covid 19 are zero for a moderate outbreak. Now that can’t be right as we’ve had deaths already in this age cohort. It’s rare, perhaps not so rare with Delta, but it does happen. One meta-study estimates a median Infection Fatality Rate of 0.002% at age 10 and 0.01% at age 25. How they arrived at zero is a mystery, risk doesn’t just magically go to zero even for a moderate outbreak. So again it appears that AGATI are underplaying the risks of Covid 19 for the younger cohorts.

2. I went and pulled the prior ATAGI advice and the risk of death is 3% if you get a clot. So the risk of death due to TTS risk is 0.081 per 100,000 in the age group 50-59. Compared to their risk of death in a moderate outbreak of 0.1 per 100,000. I mean really ATAGI?

3. Where there is uncertainty in an estimator we should do more than just point at it and say look, it’s uncertain. And there are well tried statistical techniques that can do just that.

Imperial College London just  updated their report on interventions (other than pharmacological) to reduce death rates and prevent the health care system being overwhelmed. The news is not good.

They first modelled traditional mitigation strategies that seek to slow but not stop the spread, e.g. flatten the curve, for Great Britain and the United States. For an unmitigated epidemic they found you’d end up with 510,000 deaths in Great Britain, not accounting for health systems being overwhelmed on mortality. Even with a full optimised set of mitigations in place they found that this would only reduce peak critical care demand by two-thirds and halve the number of deaths. Yet this ‘optimal’ scenario would also still result in an 8-fold higher peak demand on critical care beds over available capacity.

Their conclusion? That epidemic suppression is the only viable strategy at the current time. This has profound implications for Australia which still appears to be on a mitigation path. First, even if we do our very best there will be a reduction of at best 50% in the death rate. This translates to on the order of at least 100,000 deaths. To put that in context that’s more than Australia lost in two world wars. The associated number of sick patients would undoubtedly also overwhelm our critical care system.

The only viable alternative Imperial College identified was to act to suppress the epidemic, e.g. to reduce R (the reproductive number) to close to 1 or below. To do so would need a combination of strict case isolation, population level social distancing, household quarantines and/or school and university closure. This suppression would need to be in place for at least five months. Having supressed it a combination of rigorous case isolation and contact tracing would (hopefully) then be able deal with subsequent outbreaks.

However Australia is not doing this, the Prime Minister has made this very plain, he’s not going to, ‘turn Australia off and then back on again’. We also seem to be underestimating the numbers (see this Guardian article on NSW Health’s estimates). So in the absence of the State Governments breaking ranks we are now on an express train ride (see chart below) to a national disaster of epic proportions. Jesus.

This will be the last post on this website, so if you want to grab some of the media available under useful stuff feel free.

If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.

Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.

## Bath Iron Works Corporation Report 1995

One of the things that they don’t teach you at University is that as an engineer you will never have enough time. There’s never the time in the schedule to execute that perfect design process in your head which will answer all questions, satisfy all stakeholders and optimises your solution to three decimal places. Worse yet you’re going to be asked to commit to parts of your solution in detail before you’ve finished the overall design because for example, ‘we need to order the steel bill now because there’s a 6 month lead time, so where’s the steel bill?’. Then there’s dealing with the ‘internal stakeholders’ in the design process, who all have competing needs and agendas. You generally end up with the electrical team hating the mechanicals, nobody talking to structure and everybody hating manufacturing (1).

So good engineering managers, (2) spend a lot of their time managing the risk of early design commitment and the inherent concurrency of the program, disentangling design snarls and adjudicating turf wars over scarce resources (3). Get it right and you’re only late by the usual amount, get it wrong and very bad things can happen. Strangely you’ll not find a lot of guidance in traditional engineering education on these issues, but for my part what I’ve found to be helpful is a pragmatic design process that actually supports you in doing the tough stuff (4). Oh and being able to manage outsourcing of design would also be great (5). This all gets even more difficult when you’re trying to do vehicle design, which I liken to trying to stick 5 litres of stuff into a 4 litre container. So at the link below is my architecting approach to managing at least part of the insanity. I doubt that there’ll ever be a perfect answer to this, far too too many constraints, competing agenda’s and just plain cussedness of human beings. But if your last project was anarchy dialled up-to eleven it might be worth considering some of these possible approaches. Hope it helps, and good luck!

### Notes

1. It is a truth universally acknowledged that engineers are notoriously bad at communicating.

3. Such as, cable and piping routes, whose sensors goes at the top of mast, mass budgets, and power constraints. I’m sure we’ve all been there.

4. My observation is that (some) engineers tend to design their processes to be perfect and conveniently ignore the ugly messiness of the world, because they are uncomfortable with  being accountable for decisions made under uncertainty. Of course when you can’t both follow these processes and get the job done these same engineers will use this as a shield from all blame, e.g. ‘If you’d only followed our process..’ say they, `sure…’ say I.

5. A traditional management ploy to reduce costs, but rarely does management consider that you then need to manage that outsourced effort which takes particular mix of skills. Yet another kettle of dead fish as Boeing found out on the B787.

### Reference

So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?

In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:

1. Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
2. Provide a fallback AoA source using an AoA estimator.
3. Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
4. Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
5. Use non-aristotlean logic to better model the trustworthiness of air data.
6. Provide the current master/slave hierarchy status amongst voting channels to aircrew.
7. Provide an obvious and intuitive way to  to remove a faulted channel allowing flight under reversionary laws (7).
8. Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).

As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.

#### Notes

1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.

2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.

3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.

4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).

5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.

6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.

7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.

8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.

One of the perennial problems we face in a system safety program is how to come up with a convincing proof for the proposition that a system is safe. Because it’s hard to prove a negative (in this case the absence of future accidents) the usual approach is to pursue a proof by contradiction, that is develop the negative proposition that the system is unsafe, then prove that this is not true, normally by showing that the set of identified specific propositions of `un-safety’ have been eliminated or controlled to an acceptable level.  Enter the term `hazard’, which in this context is simply shorthand for  a specific proposition about the unsafeness of a system. Now interestingly when we parse the set of definitions of hazard we find the recurring use of terms like, ‘condition’, ‘state’, ‘situation’ and ‘events’ that should they occur will inevitably lead to an ‘accident’ or ‘mishap’. So broadly speaking a hazard is a explanation based on a defined set of phenomena, that argues that if they are present, and given there exists some relevant domain source (1) of hazard an accident will occur. All of which seems to indicate that hazards belong to a class of explanatory models called covering laws. As an explanatory class Covering laws models were developed by the logical positivist philosophers Hempel and Popper because of what they saw as problems with an over reliance on inductive arguments as to causality.

As a covering law explanation of unsafeness a hazard posits phenomenological facts (system states, human errors, hardware/software failures and so on) that confer what’s called nomic expectability on the accident (the thing being explained). That is, the phenomenological facts combined with some covering law (natural and logical), require the accident to happen, and this is what we call a hazard. We can see an archetypal example in the Source-Mechanism-Outcome model of Swallom, i.e. if we have both a source and a set of mechanisms in that model then we may expect an accident (Ericson 2005). While logical positivism had the last nails driven into it’s coffin by Kuhn and others in the 1960s and it’s true, as Kuhn and others pointed out, that covering model explanations have their fair share of problems so to do other methods (2). The one advantage that covering models do possess over other explanatory models however is that they largely avoid the problems of causal arguments. Which may well be why they persist in engineering arguments about safety.

### Notes

1. The source in this instance is the ‘covering law’.

2. Such as counterfactual, statistical relevance or causal explanations.

### References

Ericson, C.A. Hazard Analysis Techniques for System Safety, page 93, John Wiley and Sons, Hoboken, New Jersey, 2005.

Here’s a working draft of the introduction and first chapter of my book…. Enjoy 🙂

We are hectored almost daily basis on the imminent threat of islamic extremism and how we must respond firmly to this real and present danger. Indeed we have proceeded far enough along the escalation of response ladder that this, presumably existential threat, is now being used to justify talk of internment without trial. So what is the probability that if you were murdered, the murderer would be an immigrant terrorist?

In NSW in 2014 there were 86 homicides, of these 1 was directly related to the act of a homegrown islamist terrorist (1). So there’s a 1 in 86 chance that in that year if you were murdered it was at the hands of a mentally disturbed asylum seeker (2). Hmm sounds risky, but is it? Well there was approximately 2.5 million people in NSW in 2014 so the likelihood of being murdered (in that year) is in the first instance 3.44e-5. To figure out what the likelihood of being murdered and that murder being committed by a terrorist  we just multiply this base rate by the probability that it was at the hands of a `terrorist’, ending up with 4e-7 or 4 chances in 10 million that year. If we consider subsequent and prior years where nothing happened that likelihood becomes even smaller.

Based on this 4 in 10 million chance the NSW government intends to build a super-max 2 prison in NSW, and fill it with ‘terrorists’ while the Federal government enacts more anti-terrorism laws that take us down the road to the surveillance state, if we’re not already there yet. The glaring difference between the perception of risk and the actuality is one that politicians and commentators alike seem oblivious to (3).

Notes

1. One death during the Lindt chocolate siege that could be directly attributed to the `terrorist’.

2. Sought and granted in 2001 by the then Liberal National Party government.

3. An action that also ignores the role of prisons in converting inmates to Islam as a route to recruiting their criminal, anti-social and violent sub-populations in the service of Sunni extremists.

The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.

Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.

“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”

Capt. Kevin Sullivan (QF72 flight)

From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.

“Damn-it! We’re going to crash. It can’t be true! (copilot #1)

“But, what’s happening? copilot #2)

AF447 CVR transcript (final words)

Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident.  In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.

For those of you who might be wondering at the lack of recent posts I’m a little pre-occupied at the moment as I’m writing a book. Hope to have a first draft ready in July. ; )

A recent workplace health and safety case in Australia has emphasised that an employer does not have to provide training for tasks that are considered to be ‘relatively’ straight forward. The presiding judge also found that while changes to the workplace  could in theory be made, in practice it would be unreasonable to demand that the employer make such changes. The judge’s decision was subsequently upheld on appeal.

What’s interesting is the close reasoning of the court (and the appellate court) to establish what is reasonable and practicable in the circumstances. While the legal system is not perfect it does have a long standing set of practices and procedures for getting at the truth. Perhaps we may be able to learn something from the legal profession when thinking about the safety of critical systems. More on this later.

Cowie v Gungahlin Veterinary Services Pty Ltd [2016] ACTSC 311 (25 October 2016)

Second part of the SBS documentary on line now. Looking at the IoT this episode.

### More infernal statistics

Well, here we are again. Given recent developments in the infernal region it seems like a good time for another post. Have you ever, dear reader, been faced with the problem of how to achieve an unachievable safety target? Well worry no longer! Herewith is Screwtape’s patented man based mitigation medicine.

The first thing we do is introduce the concept of ‘mitigation’, ah what a beautiful word that is. You see it’s saying that it’s OK that your system doesn’t meet its safety target, because you can claim credit for the action of an external mitigator in the environment. Probability wise if the probability of an accident is P_a then P_a equals the product of your systems failure probability P_s and. the probability that some external mitigation also fails P_m or P_a = P_s X P_m.

So let’s use operator intervention as our mitigator, lovely and vague. But how to come up with a low enough P_m? Easy, we just look at the accident rate that has occurred for this or a like system and assume that these were due to operator mitigation being unsuccessful. Voila, we get our really small numbers.

Now, an alert reader might point out that this is totally bogus and that P_m is actually the likelihood of operator failure when the system fails. Operators failing, as those pestilential authors of the WASH1400 study have pointed out, is actually quite likely. But I say, if your customer is so observant and on the ball then clearly you are not doing your job right. Try harder or I may eat your soul, yum yum.

Yours hungrily,

Screwtape.

## Milton Friedman

About time I hear you say! 🙂

Yes I’ve just rewritten a post on functional failure taxonomies to include how to use them to gauge the completeness of your analysis. This came out of a question I was asked in a workshop that went something like, ‘Ok mr big-shot consultant tell us, exactly how do we validate that our analysis is complete?’. That’s actually a fair question, standards like EUROCONTROL’s SAM Handbook and ARP 4761 tell you you ought to, but are not that helpful in the how to do it department. Hence this post.

Using a taxonomy to determine the coverage of the analysis is one approach to determining completeness. The other is to perform at least two analyses using different techniques and then compare the overlap of hazards using a capture/recapture technique. If there’s a high degree of overlap you can be confident there’s only a small hidden population of hazards as yet unidentified. If there’s a very low overlap, you may have a problem.

Apparently 2 million Australians trying to use the one ABS website because they’re convinced the government will fine them if they don’t is the freshly minted definition of “Distributed Denial of Service (DDOS)” attack. 🙂

Alternative theory, who would have thought that foreign nationals (oh all right, we all know it’s the Chinese*) might try to disrupt the census in revenge for those drug cheat comments at the Olympics?

Interesting times. I hope the AEC is taking notes for their next go at electronic voting

#censusfail

Currently enjoying watching the ABS Census website burn to the ground. Ah schadenfreude, how sweet you are.

Census time again, and those practical jokers at the Australian Bureau of Statistics have managed to spring a beauty on the Australian public. The  joke being that, rather than collecting the data anonymously you are now required to fill in your name and address which the ABS will retain (1). This is a bad idea, in fact it’s a very bad idea, not quite as bad as say getting stuck in a never ending land war in the Middle East, but certainly much worse than experiments in online voting. Continue Reading…

## Matthew Squair

#### Vale Challenger

The anniversary of the loss of Challenger passed by on thursday. In memorium, I’ve updated my post that deals with the failure of communication that I think lies at the heart of that disaster.

## Richard P. Feynman

Here’s a copy of the presentation that I gave at ASSC 2015 on how to use MIL-STD-882C to demonstrate compliance to the WHS Act 2011. The Model Australian Workplace Health and Safety (WHS) Act places new and quite onerous requirements upon manufacturer, suppliers and end users organisations. These new requirements include the requirement to demonstrate due diligence in the discharge of individual and corporate responsibilities. Traditionally contracts have steered clear of invoking Workplace Health and Safety (WHS) legislation in anything other than a most abstract form, unfortunately such traditional approaches provide little evidence with which to demonstrate compliance with the WHS act.

The presentation describes an approach to establishing compliance with the WHS Act (2011) using the combination of a contracted MIL-STD-882C system safety program and a compliance finding methodology. The advantages and effectiveness of this approach in terms of establishing compliance with the act and the effective discharge the responsibilities of both supplier and acquirer are illustrated using a case study of a major aircraft modification program. Limitations of the approach are then discussed given the significant difference between the decision making criteria of classic systems safety and the so far as is reasonably practicable principle.

More woes for OPM, and pause for thought for the proponents of centralized government data stores. If you build it they will come…and steal it.

Just attended the Australian System Safety Conference, the venue was the Customs House right on River. Lots of excellent speakers and interesting papers, I enjoyed Drew Rae’s on tribalism in system safety particularly.  The keynotes on resilience by John Bergstrom and cyber-security by Chris Johnson were also very good. I gave a presentation on the use of MIL-STD-882 as a tool for demonstrating compliance to the WHS Act, a subject that only a mother could love. Favourite moment? Watching the attendees faces when I told them that 61508 didn’t comply with the law. 🙂

Thanks again to Kate Thomson and John Davies for reviewing the legal aspects of my paper. Much appreciated guys.

Just added a short case study on the Patriot software timing error to the software safety micro course page. Peter Ladkin has also subjected the accident to a Why Because Analysis.

In case you’re wondering what’s going on dear reader, human factors can be a bit dry, and the occasional poster style blog posts you may have noted is my attempt to hydrate the subject a little. The continuing series can be found on the page imaginatively titled Human error in pictures, and who knows someone may find it useful…

An interesting little exposition of the current state of the practice in information risk management using the metaphor of the bald tire on the FAIR wiki. The authors observe that there’s much more shamanistic ritual (dressed up as ‘best practice’) than we’d like to think in risk assessment. A statement that I endorse, actually I think it’s mummery for the most part, but ehem, don’t tell the kids.

Their two fold point. First that while experience and intuition are vital, on their own they give little grip to critical examination. Second that if you want to manage you must measure, and to measure you need to define.

A disclaimer, I’m neither familiar with or a proponent of the FAIR tool, and I strongly doubt as to whether we can ever put risk management onto a truly scientific footing, much like engineering there’s more art than artifice, but it’s an interesting commentary nonetheless.

I give it 011 out 101 tooled up script kiddies.

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here's an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 32,000 times in 2014. If it were a concert at Sydney Opera House, it would take about 12 sold-out performances for that many people to see it.

Yep that’s right, due to popular demand I’m running ZEIT 8236 System Safety as an Intensive Delivery mode course in the second session at ADFA from the 13th to 17th of July 2015. If you want a flavour, here’s the introductory module. Remember, I love this stuff. 🙂

So I’ve been invited to to give a talk on risk at the conference dinner. Should be interesting.

An interesting article in Forbes on human error in a very unforgiving environment, i.e. treating ebola patients, and an excellent use of basic statistics to prove that cumulative risk tends to do just that, accumulate. As the number of patients being treated in the west is pretty low at the moment it also gives a good indication of just how infectious Ebola is. One might also infer that the western medical establishment is not quite so smart as it thought it was, at least when it comes to treating the big E safely.

Of course the moment of international zen in the whole story had to be the comment by the head of the CDC Dr Friedan, that and I quote “clearly there was a breach in protocol”, a perfect example of affirming the consequent. As James Reason pointed out years ago there are two ways of dealing with human error, so I guess we know where the head of the CDC stands on that question. 🙂

If you were wondering why the Outliers post was, ehem, a little rough I accidentally posted an initial draft rather than the final version. I’ve now released the right one.

On Artificial Intelligence as ethical prosthesis

Out here in the grim meat-hook present of Reaper missions and Predator drone strikes we’re already well down track to a future in which decisions as to who lives and who dies are made less and less by human beings, and more and more by automation. Although there’s been a lot of ‘sexy’ discussion recently of the possibility of purely AI decision making, the current panic misses the real issue d’jour, that is the question of how well current day hybrid human-automation systems make such decisions, and the potential for the incremental abrogation of moral authority by the human part of this cybernetic system as the automation in this synthesis becomes progressively more sophisticated and suasive.

As Dijkstra pointed out in the context of programming, one of the problems or biases humans have in thinking about automation is that because it ‘does stuff’, we find the need to imbue it with agency, and from there it’s a short step to treating the automation as a partner in decision making. From this very human misunderstanding it’s almost inevitable that the the decision maker holding such a view will feel that the responsibility for decisions are shared, and responsibility diluted, thereby opening up potential for choice shift in decision making. As the degree of sophistication of such automation increases of course this effect becomes stronger and stronger, even though ‘on paper’ we would not recognise the AI as a rational being in the Kantian sense.

Even the design of decision support system interfaces can pose tricky problems when an ethical component is present, as the dimensions of ethical problem solving (time intensiveness, consideration, uncertainty, uniqueness and reflection) directly conflict with those that make for efficient automation (brevity, formulaic, simplification, certainty and repetition). This inherent conflict thereby ensuring that the interaction of automation and human ethical decision making becomes a tangled and conflicted mess. Technologists of course look at the way in which human beings make such decisions in the real world and believe, rightly or wrongly, that automation can do better. What we should remember is that such automation is still a proxy for the designer, if the designer has no real understanding of the needs of the user in forming such ethical decisions then if if the past is any guide we are up for a future of poorly conceived decision support systems, with all the inevitable and unfortunate consequences that attend. In fact I feel confident in predicting that the designers of such systems will, once again, automate their biases about how humans and automation should interact, with unpleasant surprises for all.

In a broader sense what we’re doing with this current debate is essentially rehashing the old arguments between two world views on the proper role of automation, on the one side automation is intended to supplant those messy, unreliable humans, in the current context effecting an unintentional ethical prosthetic. On the other hand we have the view that automation can and should be used to assist and augment human capabilities, that is it should be used to support and develop peoples innate ethical sense. Unfortunately in this current debate it also looks like the prosthesis school of thought is winning out. My view is that if we continue in this approach of ‘automating out’ moral decision making we will inevitably end up with the amputation of ethical sense in the decision maker, long before killer robots stalk the battlefield, or the high street of your home town.

Just added a modified version of the venerable subjective 882 hazard risk matrix to my useful stuff page in which I fix a few issues that have bugged me about that particular tool, see Risk and the Matrix for a fuller discussion of the problems with risk matrices. For those of you with a strong interest in such I’ve translated the matrix into cartesian coordinates, revised the risk zone and definitions to make the matrix ‘De Moivre theorem’ compliant (and a touch more conservative), added the AIAA’s combinatorial probability thresholds, introduced a calibration point.

Update. I subsequently revised the ALARP principle to reflect added the reasonably practicable principles to address the Australian legal concept of SFAIRP to the decision making criteria and adjusted the risk curves to reflect how non-ergodic (catastrophic) risks should be treated differently.

I’ve put the original Def Stan 00-55 (both parts) onto my resources page for those who are interested in doing a compare and contrast between the old, and the new (whenever it’s RFC is released). I’ll be interested to see whether the standards reluctance to buy into the whole safety by following a process argument is maintained in the next iteration. The problem of arguing from fault density to safety that they allude to also remains, I believe, insurmountable.

The justification of how the SRS development process is expected to deliver SRS of the required safety integrity level, mainly on the basis of the performance of the process on previous projects, is covered in 7.4 and annex E. However, in general the process used is a very weak predictor of the safety integrity level attained in a particular case, because of the variability from project to project. Instrumentation of the process to obtain repeatable data is difficult and enormously expensive, and capturing the important human factors aspects is still an active research area. Furthermore, even very high quality processes only predict the fault density of the software, and the problem of predicting safety integrity from fault density is insurmountable at the time of writing (unless it is possible to argue for zero faults).

Def Stan 00-55 Issue 2 Part 2 Cl. 7.3.1

Just as an aside, the original release of Def Stan 00-56 is also worth a look as it contains the methodology for the assignment of safety integrity levels. Basically for a single function or N>1 non-independent functions the SIL assigned to the function(s) is derived from the worst credible accident severity (much like DO-178). In the case of N>1 independent functions, one of these functions gets a SIL based on severity but the remainder have a SIL rating apportioned to them based on risk criteria. From which you can infer that the authors, just like the aviation community were rather mistrustful of using estimates of probability in assuring a first line of defence. 🙂

Preamble

The following is a critique of a teleconference conducted on the 16 March  between the UK embassy in Japan and the UK Governments senior scientific advisor and members of SAGE, a UK government crisis panel formed in the aftermath of the Japanese tsunami to advise on the Fukushima crisis. These comments pertain specifically to the 16 March (UK time) teleconference with the British embassy and the minutes of SAGE meetings on the 15th and 16th that preceded that teleconference. Continue Reading…

I’ve just reread Peter Ladkin’s 2008 dissection of the conceptual problems of IEC 61508 here, and having just worked through a recent project in which 61508 SILs were applied, I tend to agree that SIL is still a bad idea, done badly… I’d also add that, the HSE’s opinion notwithstanding, I don’t actually see that the a priori application of a risk derived SIL level to a specific software development acquits ones ‘so far as is reasonably practicable’ duty of care. Of course if your regulator says it does, why then smile bravely and complement him on the wonderful cut of his new clothes. On the other hand if you’re design the safety system for a nuclear plant maybe have a look at how the aviation industry do business with their Design Assurance Levels. 🙂

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 24,000 times in 2013. If it were a concert at Sydney Opera House, it would take about 9 sold-out performances for that many people to see it.

And I’ve just updated the philosophical principles for acquiring safety critical systems. All suggestions welcome…

Enjoy 🙂

Risk as uncontrollability…

The venerable safety standard MIL-STD-882 introduced the concept of software hazard and risk in Revision C of that standard. Rather than using the classical definition of risk as combination of severity and likelihood the authors struck off down quite a different, and interesting, path.

In an earlier post I had a look at the role played by design authorities in an organisation, which can have a major affect upon both safety and project success. My focus in that post was on the authority aspect.

However another perspective on the role that a design authority performs is that of someone who is able to understand both the operational requirements for a system (e.g. those that define a need) as well as the technical (those that define a solution) and most importantly be able to translate between them.

This is a role that is well understood in architecture, but one that has seemed to diminish and dwindle in engineering where projects of any complexity are more often undertaken by large bureaucratic organisations, which also traditionally fear assigning responsibility to one person.