Imperial College London just  updated their report on interventions (other than pharmacological) to reduce death rates and prevent the health care system being overwhelmed. The news is not good.

They first modelled traditional mitigation strategies that seek to slow but not stop the spread, e.g. flatten the curve, for Great Britain and the United States. For an unmitigated epidemic they found you’d end up with 510,000 deaths in Great Britain, not accounting for health systems being overwhelmed on mortality. Even with a full optimised set of mitigations in place they found that this would only reduce peak critical care demand by two-thirds and halve the number of deaths. Yet this ‘optimal’ scenario would also still result in an 8-fold higher peak demand on critical care beds over available capacity.

Their conclusion? That epidemic suppression is the only viable strategy at the current time. This has profound implications for Australia which still appears to be on a mitigation path. First, even if we do our very best there will be a reduction of at best 50% in the death rate. This translates to on the order of at least 100,000 deaths. To put that in context that’s more than Australia lost in two world wars. The associated number of sick patients would undoubtedly also overwhelm our critical care system.

The only viable alternative Imperial College identified was to act to suppress the epidemic, e.g. to reduce R (the reproductive number) to close to 1 or below. To do so would need a combination of strict case isolation, population level social distancing, household quarantines and/or school and university closure. This suppression would need to be in place for at least five months. Having supressed it a combination of rigorous case isolation and contact tracing would (hopefully) then be able deal with subsequent outbreaks.

However Australia is not doing this, the Prime Minister has made this very plain, he’s not going to, ‘turn Australia off and then back on again’. We also seem to be underestimating the numbers (see this Guardian article on NSW Health’s estimates). So in the absence of the State Governments breaking ranks we are now on an express train ride (see chart below) to a national disaster of epic proportions. Jesus.

NSWCOVID19

 

Farewell

22/02/2019

This will be the last post on this website, so if you want to grab some of the media available under useful stuff feel free.

Well, as someone said, because it’s the worst of social media, combined with the worst of corporate culture and the worst of website design. Because dealing with it regularly is as interesting as cleaning out my sock draw, and because the tone, like an endless ritalin fuelled rotary meeting, is just plain unhealthy. The philosopher Kant once said that you should always treat human beings as ends in themselves, and never as just the means to an end. Well LinkedIn for crimes against the categorical imperative alone you have to go…

Black Saturday

07/02/2019

So ten years on from the Black Saturday fire we’re taking a national moment to remember the unstinting heroism displayed in the face of the hell that was Black Saturday day and all that was lost. Of course we’re not quite so diligent in remembering that we haven’t prevented people rebuilding in high risk areas, nor in improving the fire resistance of people’s homes, nor yet in managing down the risk through realistic burn off programs, nor for that matter have we noticed that the fuel burden in 2018 is back up to the same levels as it was ten years ago. So perhaps instead we should reflect on how we’ve squandered the opportunity for reform. And perhaps we should remember that a fire will come again, to burn our not so clever country .

Spruiking zero harm or crusading safety ‘because you care’ raises as much suspicion as having a folder on your computer named ‘DEFINITELY NOT PORN’ – would you get on a plane that had “ZERO CRASH” emblazoned all over it?

David Collins

The deadline for you to opt out of the government’s ill advised national health record system is rapidly approaching, and for the record yes I have opted out. I’ll give you a concrete example of what I’m talking about when I say ‘ill advised’, currently it’s assumed that you’ll be OK to share your anonymised medical data for research purposes by setting sharing it as a default. This is despite it being shown time and time again that the anonymisation of such data just doesn’t work. You might share my concerns about this lack of concern and level of indifference to the idea of informed consent. What the agencies of the state clearly don’t get is that this this information belongs to you and I, it doesn’t belong to my doctor, your medical data is yours your doctor holds it in trust for you. And until the state demonstrates a clear and unequivocal understanding of that point I say no thanks and I’d invite you all to do the same. My Health Record? Not so much.

PS. The architect of My Health Record is Tim Kelsey, yes that same Tim Kelsey who presided over the UK Government’s Care.data, program which tanked over sharing data without explicit consent. And unfortunately for us that attitude is baked into My Health Record’s DNA.

PPS. To me the carelessness of of the government in this whole affair is indicative of the increasingly totalitarian relationship between the government and the people.

Facebook and Google back Labor changes to laws which break encryption

…it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

Hans Moravec

Uber’s safety management woes

We shouldn’t be killing people in our haste to get to a safe future

Dr Phil Koopman (on driverless cars)

Here’s a view from inside Tesla by one of it’s former employees. Taking the report at face value, which is of course an arguable proposition, you can see how technical debt can build up to a point where it’s near impossible to pay it down. That in turn can have significant effects on the safety performance of the organisation, see the Toyota spaghetti code case as another example. The take home for this is for any software safety effort it’s a good idea to see whether the company/team is measuring technical debt in a meaningful fashion and are actively retiring it, for example by alternating capability and maintenance updates.

Tesla and technical debt.

https://mobile.twitter.com/atomicthumbs/status/1032939617404645376

And the encryption law is passed…

A debate on tools, assurance and ethics

Simply put, it is possible to have convenience if you want to tolerate insecurity, but if you want security, you must be prepared for inconvenience.

Gen. Benjamin Chidlaw (1954)

It is highly questionable whether total system safety is always enhanced by allocating functions to automatic devices rather than human operators, and there is some reason to believe that flight-deck automation may have already passed its optimum point.

Earl Wiener (1980)

If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.

When you look at the safety performance of industries which have a consistent focus on safety as part of the social permit, nuclear or aviation are the canonical examples, you see that over time increases in safety tend to plateau out. This looks like some form of a learning curve, but what’s the mechanism, or mechanisms that actually drives this process? I believe there are two factors at play here, firstly the increasing marginal cost of improvement and secondly the problem of learning from events that we are trying to prevent.

Increasing marginal cost is simply an economist’s way of stating that it will cost more to achieve that next increment in performance. For example, airbags are more expensive than seat-belts by roughly an order of magnitude (based on replacement costs) however airbags only deliver 8% reduced mortality when used in conjunction with seat belts, see Crandall (2001). As a result the next increment in safety takes longer and costs more (1).

The learning factor is in someways like an informational version of the marginal cost rule. As we reduce accident rates accidents become rarer. Now one of the traditional ways in which safety improvements occur is through studying accidents when they occur and then to eliminate or mitigate identified causal factors. Obviously as the accident rate decreases this likewise the opportunity for improvement also decreases. When accidents do occur we have a further problem because (definitionally) the cause of the accident will comprise a highly unlikely combination of factors that are needed to defeat the existing safety measures. Corrective actions for such rare combination of events therefore are highly specific to that event’s context and conversely will have far less universal applicability.  For example the lessons of metal fatigue learned from the Comet airliner disaster has had universal applicability to all aircraft designs ever since. But the QF-72 automation upset off Learmouth? Well those lessons, relating to the specific fault tolerance architecture of the A330, are much harder to generalise and therefore have less epistemic strength.

In summary not only does it cost more with each increasing increment of safety but our opportunity to learn through accidents is steadily reduced as their arrival rate and individual epistemic value (2) reduce.

Notes

1. In some circumstances we may also introduce other risks, see for example the death and severe injury caused to small children from air bag deployments.

2. In a Popperian sense.

References

1. Crandall, C.S., Olson, L.M.,  P. Sklar, D.P., Mortality Reduction with Air Bag and Seat Belt Use in Head-on Passenger Car Collisions, American Journal of Epidemiology, Volume 153, Issue 3, 1 February 2001, Pages 219–224, https://doi.org/10.1093/aje/153.3.219.

AI winter

Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.

Piece of wing found on La Réunion Island, is that could be flap of #MH370 ? Credit: reunion 1ere

The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470  this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.

As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).

One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.

I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?

Notes

1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.

2. No go alone rules are unfortunately only partially effective.

References

Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.

People must retain control of autonomous vehicles

Talking to one another not intuitive for engineers…

Bath Iron Works Corporation Report 1995

One of the things that they don’t teach you at University is that as an engineer you will never have enough time. There’s never the time in the schedule to execute that perfect design process in your head which will answer all questions, satisfy all stakeholders and optimises your solution to three decimal places. Worse yet you’re going to be asked to commit to parts of your solution in detail before you’ve finished the overall design because for example, ‘we need to order the steel bill now because there’s a 6 month lead time, so where’s the steel bill?’. Then there’s dealing with the ‘internal stakeholders’ in the design process, who all have competing needs and agendas. You generally end up with the electrical team hating the mechanicals, nobody talking to structure and everybody hating manufacturing (1).

So good engineering managers, (2) spend a lot of their time managing the risk of early design commitment and the inherent concurrency of the program, disentangling design snarls and adjudicating turf wars over scarce resources (3). Get it right and you’re only late by the usual amount, get it wrong and very bad things can happen. Strangely you’ll not find a lot of guidance in traditional engineering education on these issues, but for my part what I’ve found to be helpful is a pragmatic design process that actually supports you in doing the tough stuff (4). Oh and being able to manage outsourcing of design would also be great (5). This all gets even more difficult when you’re trying to do vehicle design, which I liken to trying to stick 5 litres of stuff into a 4 litre container. So at the link below is my architecting approach to managing at least part of the insanity. I doubt that there’ll ever be a perfect answer to this, far too too many constraints, competing agenda’s and just plain cussedness of human beings. But if your last project was anarchy dialled up-to eleven it might be worth considering some of these possible approaches. Hope it helps, and good luck!

Notes

1. It is a truth universally acknowledged that engineers are notoriously bad at communicating.

2. We don’t talk about the bad.

3. Such as, cable and piping routes, whose sensors goes at the top of mast, mass budgets, and power constraints. I’m sure we’ve all been there.

4. My observation is that (some) engineers tend to design their processes to be perfect and conveniently ignore the ugly messiness of the world, because they are uncomfortable with  being accountable for decisions made under uncertainty. Of course when you can’t both follow these processes and get the job done these same engineers will use this as a shield from all blame, e.g. ‘If you’d only followed our process..’ say they, `sure…’ say I.

5. A traditional management ploy to reduce costs, but rarely does management consider that you then need to manage that outsourced effort which takes particular mix of skills. Yet another kettle of dead fish as Boeing found out on the B787.

Reference

1.  A Vehicle design process v1.0

So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?

In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:

  1. Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
  2. Provide a fallback AoA source using an AoA estimator.
  3. Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
  4. Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
  5. Use non-aristotlean logic to better model the trustworthiness of air data.
  6. Provide the current master/slave hierarchy status amongst voting channels to aircrew.
  7. Provide an obvious and intuitive way to  to remove a faulted channel allowing flight under reversionary laws (7).
  8. Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).

As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.

Notes

1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.

2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.

3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.

4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).

5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.

6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.

7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.

8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.

One of the perennial problems we face in a system safety program is how to come up with a convincing proof for the proposition that a system is safe. Because it’s hard to prove a negative (in this case the absence of future accidents) the usual approach is to pursue a proof by contradiction, that is develop the negative proposition that the system is unsafe, then prove that this is not true, normally by showing that the set of identified specific propositions of `un-safety’ have been eliminated or controlled to an acceptable level.  Enter the term `hazard’, which in this context is simply shorthand for  a specific proposition about the unsafeness of a system. Now interestingly when we parse the set of definitions of hazard we find the recurring use of terms like, ‘condition’, ‘state’, ‘situation’ and ‘events’ that should they occur will inevitably lead to an ‘accident’ or ‘mishap’. So broadly speaking a hazard is a explanation based on a defined set of phenomena, that argues that if they are present, and given there exists some relevant domain source (1) of hazard an accident will occur. All of which seems to indicate that hazards belong to a class of explanatory models called covering laws. As an explanatory class Covering laws models were developed by the logical positivist philosophers Hempel and Popper because of what they saw as problems with an over reliance on inductive arguments as to causality.

As a covering law explanation of unsafeness a hazard posits phenomenological facts (system states, human errors, hardware/software failures and so on) that confer what’s called nomic expectability on the accident (the thing being explained). That is, the phenomenological facts combined with some covering law (natural and logical), require the accident to happen, and this is what we call a hazard. We can see an archetypal example in the Source-Mechanism-Outcome model of Swallom, i.e. if we have both a source and a set of mechanisms in that model then we may expect an accident (Ericson 2005). While logical positivism had the last nails driven into it’s coffin by Kuhn and others in the 1960s and it’s true, as Kuhn and others pointed out, that covering model explanations have their fair share of problems so to do other methods (2). The one advantage that covering models do possess over other explanatory models however is that they largely avoid the problems of causal arguments. Which may well be why they persist in engineering arguments about safety.

Notes

1. The source in this instance is the ‘covering law’.

2. Such as counterfactual, statistical relevance or causal explanations.

References

Ericson, C.A. Hazard Analysis Techniques for System Safety, page 93, John Wiley and Sons, Hoboken, New Jersey, 2005.

Here’s a working draft of the introduction and first chapter of my book…. Enjoy 🙂

We are hectored almost daily basis on the imminent threat of islamic extremism and how we must respond firmly to this real and present danger. Indeed we have proceeded far enough along the escalation of response ladder that this, presumably existential threat, is now being used to justify talk of internment without trial. So what is the probability that if you were murdered, the murderer would be an immigrant terrorist?

In NSW in 2014 there were 86 homicides, of these 1 was directly related to the act of a homegrown islamist terrorist (1). So there’s a 1 in 86 chance that in that year if you were murdered it was at the hands of a mentally disturbed asylum seeker (2). Hmm sounds risky, but is it? Well there was approximately 2.5 million people in NSW in 2014 so the likelihood of being murdered (in that year) is in the first instance 3.44e-5. To figure out what the likelihood of being murdered and that murder being committed by a terrorist  we just multiply this base rate by the probability that it was at the hands of a `terrorist’, ending up with 4e-7 or 4 chances in 10 million that year. If we consider subsequent and prior years where nothing happened that likelihood becomes even smaller.

Based on this 4 in 10 million chance the NSW government intends to build a super-max 2 prison in NSW, and fill it with ‘terrorists’ while the Federal government enacts more anti-terrorism laws that take us down the road to the surveillance state, if we’re not already there yet. The glaring difference between the perception of risk and the actuality is one that politicians and commentators alike seem oblivious to (3).

Notes

1. One death during the Lindt chocolate siege that could be directly attributed to the `terrorist’.

2. Sought and granted in 2001 by the then Liberal National Party government.

3. An action that also ignores the role of prisons in converting inmates to Islam as a route to recruiting their criminal, anti-social and violent sub-populations in the service of Sunni extremists.

The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.

Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.

“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”

Capt. Kevin Sullivan (QF72 flight)

From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.

“Damn-it! We’re going to crash. It can’t be true! (copilot #1)

“But, what’s happening? copilot #2)

AF447 CVR transcript (final words)

Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident.  In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.

 

For those of you who might be wondering at the lack of recent posts I’m a little pre-occupied at the moment as I’m writing a book. Hope to have a first draft ready in July. ; )

And there goes net neutrality & privacy… Thanks Trump

With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from this post written in the aftermath of the 2009 Victorian Black Saturday fires.

So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…

So how did we get to this?  Simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the occurrence rate of extremely dry conditions and the heavy tail of bushfire severity. Yes we’ve been twisting the dragon’s tail and now it’s woken up…

2019 Postscript: Monday 11 November 2019 – NSW

And here we are in 2019 two years down the track from the fires of 2017 and tomorrow looks like being a beyond catastrophic fire day. Firestorms are predicted.

Matrix (Image source: The Matrix film)

How algorithm can kill…

So apparently the Australian Government has been buying it’s software from Cyberdyne Systems, or at least you’d be forgiven for thinking so given the brutal treatment Centerlink’s autonomous debt recovery software has been handing out to welfare recipients who ‘it’ believes have been rorting the system. Yep, you heard right it’s a completely automated compliance operation (well at least the issuing part).  Continue Reading…

A recent workplace health and safety case in Australia has emphasised that an employer does not have to provide training for tasks that are considered to be ‘relatively’ straight forward. The presiding judge also found that while changes to the workplace  could in theory be made, in practice it would be unreasonable to demand that the employer make such changes. The judge’s decision was subsequently upheld on appeal.

What’s interesting is the close reasoning of the court (and the appellate court) to establish what is reasonable and practicable in the circumstances. While the legal system is not perfect it does have a long standing set of practices and procedures for getting at the truth. Perhaps we may be able to learn something from the legal profession when thinking about the safety of critical systems. More on this later.

Cowie v Gungahlin Veterinary Services Pty Ltd [2016] ACTSC 311 (25 October 2016)

Second part of the SBS documentary on line now. Looking at the IoT this episode. 

Cyberwar documentary now running on SBS with a good breakdown of the Stuxnet malware courtesy of the boys at Symantec. Thank you NSA, once again, for the bounty of Stuxnet… Yes, indeed thank you. 

Donald Trump

Image source: AP/LM Otero

A Trump presidency in the wings who’d have thought! And what a total shock it was to all those pollsters, commentators and apparatchiks who are now trying to explain why they got it so wrong. All of which is a textbook example of what students of risk theory call a Black Swan event. Continue Reading…

An Outside Context Problem was the sort of thing most civilisations encountered just once, and which they tended to encounter rather in the same way a sentence encountered a full stop. 

Iain Banks

To err is inhuman

10/11/2016

Screwtape(Image source: end time info)

More infernal statistics

Well, here we are again. Given recent developments in the infernal region it seems like a good time for another post. Have you ever, dear reader, been faced with the problem of how to achieve an unachievable safety target? Well worry no longer! Herewith is Screwtape’s patented man based mitigation medicine.

The first thing we do is introduce the concept of ‘mitigation’, ah what a beautiful word that is. You see it’s saying that it’s OK that your system doesn’t meet its safety target, because you can claim credit for the action of an external mitigator in the environment. Probability wise if the probability of an accident is P_a then P_a equals the product of your systems failure probability P_s and. the probability that some external mitigation also fails P_m or P_a = P_s X P_m. 

So let’s use operator intervention as our mitigator, lovely and vague. But how to come up with a low enough P_m? Easy, we just look at the accident rate that has occurred for this or a like system and assume that these were due to operator mitigation being unsuccessful. Voila, we get our really small numbers. 

Now, an alert reader might point out that this is totally bogus and that P_m is actually the likelihood of operator failure when the system fails. Operators failing, as those pestilential authors of the WASH1400 study have pointed out, is actually quite likely. But I say, if your customer is so observant and on the ball then clearly you are not doing your job right. Try harder or I may eat your soul, yum yum. 

Yours hungrily, 

Screwtape.

The internet goes nuclear

Never confuse volume with authority.

Graham Long

20140122-072236.jpg

A clank of botnets

More bad news for the Internet this week as a plague of BotNets launched a successful wave of denial of service attacks on Dyn, a dynamic domain name service provider. The attacks on Dyn propagated through to services such as Twitter (OK no great loss), Github, The Verge, Playstation Network, Box and Wix. Continue Reading…

Screwtape(Image source: end time info)

Well hello there, it’s been a while hasn’t it?

In the absence of our good host I thought I’d just pop in and offer some advice on how to use statistics for requirements compliance. Now of course what I mean by requirements compliance is that ticklish situation where the customer has you over the proverbial barrel with an eye-gouger of a requirement. What to do, what to do. Well dear reader all is not lost, what one can do is subtly rework the requirement right in front of the customer without them even recognising it…

No! I hear you say, ‘how can this wonder be achieved Screwtape?’

Well it’s really quite simple, when one understands that requirements are to a greater or lesser extent ‘operationally’ defined by their method of verification. That means that just as requirements belong to the customer so too should the method one uses to demonstrate that you’ve met them. Now if you’re in luck the customer doesn’t realise this, so you propose adopting a statistical proof  of compliance, throw in some weaselling about process variability, based on the median of a sample of tests. Using the median is important as it’s more resistant to outlier values, which is what we want to obfuscate (obviously). As the method of verification defines the requirement all of a sudden you’ve taken the customer’s deterministic requirement and turned it into a weaker probabilistic one. Even better you now have psychological control over half of the requirement, ah the beauty of psychological framing effects.

Now if you’ll excuse me all this talk of statistics has reminded me that I have some souls to reap over at the Australian Bureau of Statistics*.Mmm, those statisticians, their souls are so dry and filled with tannin, just like a fine pinot noir.

Till the next time. Yours infernally,

Screwtape

*Downstairs senior management were not amused by having to fill out their name and then having a census checker turn up on their doorstep asking whether they were having a lend of the ABS.

Accidents of potentially catastrophic potential pose a particular challenge to classical utilitarian theories of managing risk. A reader of this blog might be aware of how the presence of possibility of irreversible catastrophic outcomes (i.e. non-ergodicity) undermines a key assumption on which classical risk assessment is based. But what to do about it? Well one thing we can practically do is to ensure that when we assess risk we take into account the irreversible (non-ergodic) nature of such catastrophes and there are good reasons that we should do so, as the law does not look kindly on organisations (or people) who make decisions about risk of death purely on the basis of frequency gambling.

A while ago I put together a classical risk matrix (1) that treated risk in accordance with De Moivre’s formulation and I’ve modified this matrix to explicitly address non-ergodicity. The modification is to the extreme (catastrophic) severity column where I’ve shifted the boundary of unacceptable risk downwards to reflect that the (classical) iso-risk contour in that catastrophic case under-estimates the risk posed by catastrophic irreversible outcomes. The matrix now also imposes claim limits on risk where a SPOF may exist that could result in a catastrophic loss (2). We end up with something that looks a bit like the matrix below (3).

modified_matrix

From a decision making perspective you’ll note that not only is the threshold for unacceptable risk reduced but that for catastrophic severity (one or more deaths) there is no longer a ‘acceptable’ threshold. This is an important consideration reflecting as it does the laws position that you cannot in gamble away your duty of care, e.g justify not taking an action purely the basis of a risk threshold (4).  The final outcome of this work, along with revised likelihood and severity definitions, can be found in hazard matrix V1.1 (5). I’m still thinking about how you might introduce more consideration of epistemic and ontological risks into the matrix, it’s a work in progress.

Notes

1. Mainly to provide a canonical example of what a well constructed matrix should look like as there are an awful lot of bad ones floating around.

2. You have to either eliminate the SPOF or reduce the severity. There’s an implied treatment of epistemic uncertainty in such a claim limit that I find appealing.

3. The star represents a calibration point that’s used when soliciting subjective assessments of likelihood from SME.

4.  By the way you’re not going to find these sort of considerations in ISO 31000.

5. Important note. like all risk matrices it needs to be calibrated to the actual circumstances and risk appetite of the organisation. No warranty given and YMMV.

One of the great mistakes is to judge policies and programs by their intentions rather than their results

Milton Friedman

Whiteboard work

28/09/2016

Working through a spot of introductory work on FFBDs with the HMAS Cerberus class, coffee after all is important! 🙂

Simple sabotage…

15/09/2016

Earlier this year the US Government declassified a WWII OSS field manual on sabotage. Now the Simple Sabotage Field Manual is not what you might think. No it’s not a 101 on blowing up bridges, nor is it a cookbook for how to conduct Operation Kutschera, but rather it’s aimed at a lower key sabotage of ordinary working practices inside the organisation. For example using conferences and meetings to strategically delay decision making. Nobody get kills but that new Panzer design with the Porsche turret? Well sorry Reichs Marshall it’ll be buried in design committee until about 1948. Charlie Stross went on to twitter asking for modern updates to the OSS manual, I’m not sure whether that exercise increased or decreased the net sum of human happiness, but hey, it was amusing.

Which got me to thinking, if you read the OSS manual and find that every working day seems like a text book play courtesy of the boys from Prince William Park, then shouldn’t you logically conclude that you are sitting in the middle of a war? If you see folk in your organisation regularly using moves out of the OSS play book they may not be just haplessly incompetent. If nothing else this should make you look at your daily fare of corporate hooey in a new light. So stay frosty people, and remember three times is enemy action.

🙂

About time I hear you say! 🙂

Yes I’ve just rewritten a post on functional failure taxonomies to include how to use them to gauge the completeness of your analysis. This came out of a question I was asked in a workshop that went something like, ‘Ok mr big-shot consultant tell us, exactly how do we validate that our analysis is complete?’. That’s actually a fair question, standards like EUROCONTROL’s SAM Handbook and ARP 4761 tell you you ought to, but are not that helpful in the how to do it department. Hence this post.

Using a taxonomy to determine the coverage of the analysis is one approach to determining completeness. The other is to perform at least two analyses using different techniques and then compare the overlap of hazards using a capture/recapture technique. If there’s a high degree of overlap you can be confident there’s only a small hidden population of hazards as yet unidentified. If there’s a very low overlap, you may have a problem.

The 15 commandments

04/09/2016

10cmdts

The 15 commandments of the god  of the machine

Herewith, are the 15 commandments for thine safety critical software as spoken by the machine god unto his prophet Kopetz.

  1. Thou shalt regard the system safety case as thy tabernacle of safety and derive thine critical software failure modes and requirements from it.
  2. Thou shalt adopt a fundamentally safe architecture and define thy fault tolerance hypothesis as part of this. Even unto the definition of fault containment regions, their modes of failure and likelihood.
  3. Thine fault tolerance shall include start-up operating and shutdown states
  4. Thine system shall be partitioned to ‘divide and conquer’ the design. Yea such partitioning shall include the precise specification of component interfaces by time and value such that  all manner of men shall comprehend them
  5. Thine project team shall develop a consistent model of time and state for even unto the concept of states and fault recovery by voting is the definition of time important.
  6. Yea even though thou hast selected a safety architecture pleasing to the lord, yet it is but a house built upon the sand, if no ‘programming in the small’ error detection and fault recovery is provided.
  7. Thou shall ensure that errors are contained and do not propagate through the system for a error idly propagated  to a service interface is displeasing to the lord god of safety and invalidates your righteous claims of independence.
  8. Thou shall ensure independent channels and components do not have common mode failures for it is said that homogenous redundant channels protect only from random hardware failures  neither from the common external cause such as EMI or power loss, nor from the common software design fault.
  9. Thine voting software shall follow the self-confidence principle for it is said that if the self-confidence principle is observed then a correct FCR will always make the correct decision under the assumption of a single faulty FCR, and only a faulty FCR will make false decisions.
  10. Thou shall hide and separate thy fault-tolerance mechanisms so that they do not introduce fear, doubt and further design errors unto the developers of the application code.
  11. Thou shall design your system for diagnosis for it is said that even a righteously designed fault tolerant system my hide such faults from view whereas thy systems maintainers must replace the affected LRU.
  12. Thine interfaces shall be helpful and forgive the operator his errors neither shall thine system dump the problem in the operators lap without prior warning of impending doom.
  13. Thine software shall record every single anomaly for your lord god requires that every anomaly observed during operation must be investigated until a root cause is defined
  14. Though shall mitigate further hazards introduced by your design decisions for better it is that you not program in C++ yet still is it righteous to prevent the dangling of thine pointers and memory leaks
  15. Though shall develop a consistent fault recovery strategy such that even in the face of violations of your fault hypothesis thine system shall restart and never give up.

20140122-072236.jpg

Dispatches from the cyber-front

Interesting episode on the ABC’s Four Corners program this monday that discloses more about the ongoing attacks against government computer networks. Four Corners sources confirmed that, as I predicted at the time, the Bureau of Meteorology infiltration was a beach head operation to allow further attacks on higher value government targets (such as the Australian Geospatial-Intelligence Organisation and Intelligence/Surveillance assets such as the JORN  system). OK, smug mode off. Continue Reading…