Out of this nettle – danger, we pluck this flower – safety.
A pertinent article by Bruce Schneir on the toxicity of long stored data. Perhaps David Kadisch, head of the ABS, will read this and have a long hard think about what Bruce is saying, but probably not.
Side note. There may be a more direct and specific reason why the Feds have kyboshed the sale of NSWs power poles to the Chinese than wooly national security concerns…
Apparently 2 million Australians trying to use the one ABS website because they’re convinced the government will fine them if they don’t is the freshly minted definition of “Distributed Denial of Service (DDOS)” attack. 🙂
Alternative theory, who would have thought that foreign nationals (oh all right, we all know it’s the Chinese*) might try to disrupt the census in revenge for those drug cheat comments at the Olympics?
Interesting times. I hope the AEC is taking notes for their next go at electronic voting
Currently enjoying watching the ABS Census website burn to the ground. Ah schadenfreude, how sweet you are.
Census time again, and those practical jokers at the Australian Bureau of Statistics have managed to spring a beauty on the Australian public. The joke being that, rather than collecting the data anonymously you are now required to fill in your name and address which the ABS will retain (1). This is a bad idea, in fact it’s a very bad idea, not quite as bad as say getting stuck in a never ending land war in the Middle East, but certainly much worse than experiments in online voting. Continue Reading…
After millions of dollars and years of effort the ATSB has suspended it’s search for the wreck of MH370. There’s some bureaucratic weasel words, but we are done people. Of course had the ATSB applied Bayesian search techniques, as the USN did in the successful search for it’s missing USS Scorpion, we might actually know where it is.
One can construct convincing proofs quite readily of the ultimate futility of exhaustive testing of a program and even of testing by sampling. So how can one proceed? The role of testing, in theory, is to establish the base propositions of an inductive proof. You should convince yourself, or other people, as firmly as possible, that if the program works a certain number of times on specified data, then it will always work on any data. This can be done by an inductive approach to the proof.
The first fatality involving the use of Tesla’s autopilot* occurred last May. The Guardian reported that the autopilot sensors on the Model S failed to distinguish a white tractor-trailer crossing the highway against a bright sky and promptly tried to drive under the trailer, with decapitating results. What’s emerged is that the driver had a history of driving at speed and also of using the automation beyond the maker’s intent, e.g. operating the vehicle hands off rather than hands on, as the screen grab above indicates. Indeed recent reports indicate that immediately prior to the accident he was travelling fast (maybe too fast) whilst watching a Harry Potter DVD. There also appears to be a community of like minded spirits out there who are intent on seeing how far they can push the automation… sigh. Continue Reading…
The NBN, an example of degraded societal resilience?
But the virtues we get by first exercising them, as also happens in the case of the arts as well. For the things we have to learn before we can do them, we learn by doing them, e.g., men become builders by building and lyreplayers by playing the lyre; so too we become just by doing just acts, temperate by doing temperate acts, brave by doing brave acts.
Why writing a safety case might (actually) be a good idea
Frequent readers of my blog would probably realise that I’m a little sceptical of safety cases, as Scrooge remarked to Morely’s ghost, “There’s more of gravy than of grave about you, whatever you are!” So to for safety cases, oft more gravy than gravitas about them in my opinion, regardless of what their proponents might think.
An argument is defined by what it ignores and the perspectives it opposes (explicitly or implicitly)
Added in the system safety planning module of my system safety course to the freeware available on this site. As Eisenhower remarked it’s all about the planning. 🙂
Update to the safety case module of my UNSW course. Just added a little bit more on how to structure a physical safety case report.
Inertia has a thousand fathers…
Yep, my annual teaching gig at UNSW’s Canberra campus is coming up, from July 18th to 22 inclusive, to be precise. A one week intensive, no holds barred tour de force of system safety, and amazingly we still have a few seats left. Yes you too can be thrilled, awed and amused by my pedagogical skills, and if you’re still interested in catching a show then check out the reviews.
Of course as this is the 21st century you can also peruse the online course material here, but hey if you want to listen to me, you need to pay. Sarcasm as always is free. 🙂
Here’s a short presentation I gave on the ramifications of the Australian model WHS Act (2011) for engineers when they engage in, or oversee design. The act is a complex beast, and the ramifications of it have not yet fully sunk into the engineering community.
One particularly contentious area is the application of the act to plant and materials that are imported. While the guidance material for the act gives the example of a supplier performing additional testing of the goods to demonstrate it meets Australian Standards, the reality is well, a little different.
Safety cases and that room full of monkeys
Back in 1943, the French mathematician Émile Borel published a book titled Les probabilités et la vie, in which he stated what has come to be called Borel’s law which can be paraphrased as, “Events with a sufficiently small probability never occur.” Continue Reading…
Just updated the course notes for safety cases and argument to include more on how to represent safety cases if you are not graphically inclined. All in preparation for the next system safety course in July 2016 at ADFA places still open folk! A tip o’ the hat to Chris Holloway whose work prompted the additional material. 🙂
[The designers] had no intention of ignoring the human factor But the technological questions became so overwhelming that they commanded the most attention.
A requirements checklist that can be used to evaluate the adequacy of HMI designs in safety critical applications. Based on the work of Nancy Leveson and Matthew Jaffe.
It is a very sobering feeling to be up in space and realize that one’s safety factor was determined by the lowest bidder on a government contract.
The anniversary of the loss of Challenger passed by on thursday. In memorium, I’ve updated my post that deals with the failure of communication that I think lies at the heart of that disaster.
Once more with feeling
Sonar vessels searching for Malaysia Airlines Flight MH370 in the southern Indian Ocean may have missed the jet, the ATSB’s Chief Commissioner Martin Dolan has told News Online. he went on to point out the uncertainties involved, the difficulty of terrain that could mask the signature of wreckage and that therefore problematic areas would need to be re-surveyed. Despite all that the Commissioner was confident that the wreckage site would be found by June. Me I’m not so sure.
A very late draft
I originally gave a presentation at the 2015 ASSC Conference, but never got around to finishing the companion paper. Not my most stellar title either. The paper is basically how to leverage off the very close similarities between the objectives of the WHS Act (2011) and those of MIL-STD882C, yes that standard. You might almost think the drafters of the legislation had a system safety specialist advising them… Recently I’ve had the necessity to apply this approach on another project (a ground system this time) so I took the opportunity to update the draft paper as an aide memoire, and here it is!
The long gone, but not forgotten, second issue of the UK MoD’s safety management standard DEFSTAN 00-56 introduced the concept of a qualitative likelihood of Incredible, this is however not just another likelihood category. The intention of the standard writers was that it would be used to capture risks that were deemed effectively impossible to occur, given the assumptions about the domain and system. The category was be applied to those scenarios where the hazard had been designed out, where the design concept had been assessed and it turns out that the posited hazard was just not applicable or where some non-probabilistic technique is used to verify the safety of the system (think mathematical proof). Such a category records that yes, it’s effectively impossible, while retaining the record of assessment should it become necessary to revisit it, a useful mechanism.
A.1.19 Incredible. Believed to have a probability of occurrence too low for expression in meaningful numerical terms.
DEFSTAN 00-56 Issue 2
I’ve seen this approach mangled in a number or hazard analyses were the disjoint nature of the incredible category was not recognised and it was thereafter assigned a specific likelihood that followed on in a decadal fashion from the next highest category. Yes difficulties ensued. The key is that the Incredible is not the next likelihood bin after Improbable it is in fact beyond the end of the line where we park those hazards that we have judged to have an immeasurably small likelihood of occurrence. This, we are asserting, will not happen and we are as confident of that fact as one can ever be.
“Incredible” may be exceptionally defined in terms of reasoned argument that does not rely solely on numerical probabilities.
DEFSTAN 00-56 Issue 2
To put it another way the category reflects a statement of our degree of belief that an event will not occur rather than an assertion as to its frequency of occurrence as the other subjective categories do. What the standard writers have unwittingly done is introduce a superset, in which the ‘no hazard exists’ set is represented by Incredible and the other likelihoods form the ‘a hazard exists’ set. All of which starts to sound like an mashup of frequentist probabilities with Dempster Shafer belief structures. Promising, it’s a pity the standard committee didn’t take the concept further.
The other pity is that the standard committee didn’t link this idea of “incredible” to Borel’s law. Had they done so we would have a mechanism to make explicit what I call the infinite monkey’s safety argument.
[I]t is better to think of a problem of understanding disasters as a ‘socio- technical’ problem with social organization and technical processes interacting to produce the phenomena to be studied.
The psychological basis of uncertainty
There’s a famous psychological experiment conducted by Ellsberg, called eponymously the Ellsberg paradox, in which he showed that people overwhelmingly prefer a betting scenario in which the probabilities are known, rather than one in which the odds are actually ambiguous, even if the potential for winning might be greater. Continue Reading…
The first principle is that you must not fool yourself and you are the easiest person to fool.
One of the problems that we face in estimating risk driven is that as our uncertainty increases our ability to express it in a precise fashion (e.g. numerically) weakens to the point where for deep uncertainty (1) we definitionally cannot make a direct estimate of risk in the classical sense. Continue Reading…
Just finished updating the Functional Hazard Analysis course notes (to V1.2) to expand and clarify the section on complex interaction style functional failures. To my mind complex interactions are where accidents actually occur and where the guidance provided by various standards, see SAMS or ARP 4754, is also the weakest.
In breaking news the Australian Bureau of Meteorology has been hacked by the Chinese. Government sources are quoted by the ABC as stating that the BoM has definitely been compromised and that this may in turn mean the compromise of other government departments.
We’re probably now in the Chinese’s operational end game as their first priority would have been to expropriate (read steal) as much of the Bureau’s intellectual property as they could, given that follow-up exploits of other information systems naturally carry a higher likelihood of detection. The intruders running afoul of someone else who was not quite so asleep at the switch may well be how the breach was eventually detected.
The first major problem is that the Bureau provides services to a host of government and commercial entities so it’s just about as good a platform as you could want from which to launch follow on campaigns. The second major problem is that you just can’t turn the services that the Bureau provides off, critical infrastructure is, well, critical. That means in turn that the Bureau’s server’s can’t just go dark while they hunt down the malware. As a result it’s going to be very difficult and expensive to root the problem out and also to be sure that it is. Well played PLA unit 61398, well played.
As to how this happened? Well unfortunately the idea that data is as much critical national infrastructure as say a bridge or highway just doesn’t seem to resonate with management in most Australian organisations, or at least not enough to ensure there’s what the trade calls ‘advanced persistent diligence’ to go round, or even for that matter sufficient situational awareness by management to be able to guard against evolving such high end threats.
Perusing the FAA’s system safety handbook while doing some research for a current job, I came upon an interesting definition of severities. What’s interesting is that the FAA introduces the concept of safety margin reduction as a specific form of severity (loss).
Here’s a summary of Table (3-2) form the handbook:
- Catastrophic – ‘Multiple fatalities and/or loss of system’
- Major – ‘Significant reduction in safety margin…’
- Minor – ‘Slight reduction in safety margin…’
If we think about safety margins for a functional system they represent a system state that’s a precursor to a mishap, with the margin representing some intervening set of states. But a system state of reduced safety margin (lets call it a hazard state) is causally linked to a mishap state, else we wouldn’t care, and must therefore inherit it’s severity. The problem is that in the FAA’s definition they have arbitrarily assigned severity levels to specific hazardous degrees of safety margin reduction, yet all these could still be linked causally to a catastrophic event, e.g. a mid-air collision.
What the FAA’s Systems Engineering Council (SEC) has done is conflate severity with likelihood, as a result their severity definition is actually a risk definition, at least when it comes to safety margin hazards. The problem with this approach is that we end up under treating risks as per classical risk theory. For example say we have a potential reduction in safety margin, which is also casually linked to a catastrophic outcome. Now per Table 3-2 if the reduction was classified as ‘slight’, then we would assess the probability and given the minor severity decide to do nothing, even though in reality the severity is still catastrophic. If, on the other hand, we decided to make decisions based on severity alone, we would still end up making a hidden risk judgement depending on what the likelihood of propagation form hazard state to accident state was (undefined in the handbook). So basically the definitions set you up for trouble even before you start.
My guess is that the SEC decided to fill in the lesser severities with hazard states because for an ATM system true mishaps tend to be invariably catastrophic, and they were left scratching their head for lesser severity mishap definitions. Enter the safety margin reduction hazard. The take home from all this is that severity needs to be based on the loss event, introducing intermediate hybrid hazard/severity state definitions leads inevitably to incoherence of your definition of risk. Oh and (as far as I am aware) this malformed definition has spread everywhere…
With much pomp and circumstance the attorney general and our top state security mandarin’s have rolled out the brand new threat level advisory system. Congrats to us, we are now the proud owners of a five runged ladder of terror. There’s just one small teeny tiny insignificant problem, it just doesn’t work. Yep that’s right, as a tool for communicating it’s completely void of meaning, useless in fact, a hopelessly vacuous piece of security theatre.
You see the levels of this scale are based on likelihood. But whoever designed the scale forgot to include over what duration they were estimating the likelihood. And without that duration it’s just a meaningless list of words.
Here’s how likelihood works. Say you ask me whether it’s likely to rain tomorrow, I say ‘unlikely’, now ask me whether it will rain in the next week, well that’s a bit more likely isn’t it? OK, so next you ask me whether it’ll rain in the next year? Well unless you live in Alice Springs the answer is going to be even more likely, maybe almost certain isn’t it? So you can see that the duration we’re thinking of affects the likelihood we come up with because it’s a cumulative measure.
Now ask me whether a terrorist threat was going to happen tomorrow? I’d probably say it was so unlikely that it was, ‘Not expected’. But if you asked me whether one might occur in the next year I’d say (as we’re accumulating exposure) it’d be more likely, maybe even ‘Probable’ while if the question was asked about a decade of exposure I’d almost certainly say it was, ‘Certain’. So you see how a scale without a duration means absolutely nothing, in fact it’s much worse than nothing, it actually causes misunderstanding because I may be thinking in threats across the next year, while you may be thinking about threats occurring in the next month. So it actually communicates negative information.
And this took years of consideration according to the Attorney General, man we are governed by second raters. Puts head in hands.
Governance is a lot like being a fireman. You’re either checking smoke alarms or out attending a fire.
How to deal with those pesky high risks without even trying
One of my clients recently came to me with what seemed to be an insurmountable problem in getting his facility accepted despite the presence of an unacceptably high risk of a catastrophic accident. The regulator, not happy, likewise all those mothers with placards outside his office every morning. Most upsetting. Not a problem said I, let me introduce you to the Screwtape LLC patented cut and come again risk refactoring strategy. Please forgive me now dear reader for without further ado we must do some math.
Risk is defined as the loss times probability of loss or R = L x P (1), which is the reverse of expectation, now interestingly if we have a set of individual risks we can add them together to get the total risk, for our facility we might say that total risk is R_f = (R_1 + R_2 + R_3 … + R_n). ‘So what Screwtape, this will not pacify those angry mothers!’ I hear you say? Ahh, now bear with me as I show you how we can hide, err I mean refactor, our unacceptable risk in plain view. Let us also posit that we have a number of systems S_1, S_2, S_3 and so on in our facility… Well instead of looking at the total facility risk, let’s go down inside our facility and look at risks at the system level. Given that the probability of each subsystem causing an accident is (by definition) much less, why then per system the risk must also be less! If you don’t get an acceptable risk at the system level then go down to the subsystem, or equipment level.
The fin de coup is to present this ensemble of subsystem risks as a voluminous and comprehensive list (2), thereby convincing everyone of the earnestness of your endeavours, but omit any consideration of ensemble risk (3). Of course one should be scrupulously careful that the numbers add up, even though you don’t present them. After all there’s no point in getting caught for stealing a pence while engaged in purloining the Bank of England! For extra points we can utilise subjective measures of risk rather than numeric, thereby obfuscating the proceedings further.
Needless to say my client went away a happy man, the facility was built and the total risk of operation was hidden right there in plain sight… ah how I love the remorseless bloody hand of progress.
1. Where R = Risk, L = Loss, and P = Probability after De’Moivre. I believe Screwtape keeps De’Moivre’s heart in a jar on his desk. (Ed.).
2. The technical term for this is a Preliminary Hazard Analysis.
3. Screwtape omitted to note that total risk remains the same, all we’ve done is budgeted it out across an ensemble of subsystems, i.e. R_f = R_s1 + R_s2 + R_s3 (Ed.).
Deconstructing a tail strike incident
On August 1 last year, a Qantas 737-838 (VH-VZR) suffered a tail-strike while taking off from Sydney airport, and this week the ATSB released it’s report on the incident. The ATSB narrative is essentially that when working out the plane’s Takeoff Weight (TOW) on a notepad, the captain forgot to carry the ‘1’ which resulted in an erroneous weight of 66,400kg rather than 76,400kg. Subsequently the co-pilot made a transposition error when carrying out the same calculation on the Qantas iPad resident on-board performance tool (OPT), in this case transposing 6 for 7 in the fuel weight resulting in entering 66,400kg into the OPT. A cross check of the OPT calculated Vref40 speed value against that calculated by the FMC (which uses the aircraft Zero Fuel Weight (ZFW) input rather than TOW to calculate Vref40) would have picked the error up, but the crew mis-interpreted the check and so it was not performed correctly. Continue Reading…
Why probability is not corroboration
The IEC’s 61508 standard on functional safety assigns a series of Safety Integrity Levels (SIL) that correlate to the achievement of specific hazardous failure rates. Unfortunately this definition of SILs, that ties SILs to a probabilistic metric of failure, contains a fatal flaw.
System safety course, now with more case studies and software safety!
Have just added a couple of case studies and some course notes of software hazards and integrity partitioning, because hey I know you guys love that sort of stuff 🙂
I have finally got around to putting my safety course notes up, enjoy. You can also find them off the main menu.
Feel free to read and use under the terms of the associated creative commons license. I’d note that these are course notes so I use a large amount of example material from other sources (because hey, a good example is a good example right?) and where I have a source these are acknowledged in the notes. If you think I’ve missed a citation or made an error, then let me know.
To err is human, but to really screw it up takes a team of humans and computers…
How did a state of the art cruiser operated by one of the worlds superpowers end up shooting down an innocent passenger aircraft? To answer that question (at least in part) here’s a case study that’s part of the system safety course I teach that looks at some of the casual factors in the incident.
In the immediate aftermath of this disaster there was a lot of reflection, and work done, on how humans and complex systems interact. However one question that has so far gone unasked is simply this. What if the crew of the USS Vincennes had just used the combat system as it was intended? What would have happened if they’d implemented a doctrinal ruleset that reflected the rules of engagement that they were operating under and simply let the system do its job? After all it was not the software that confused altitude with range on the display, or misused the IFF system, or was confused by track IDs being recycled… no, that was the crew.
Consider the effect that the choice of a single word can have upon the success or failure of a standard.The standard is DO-278A, and the word is, ‘approve’. DO-278 is the ground worlds version of the aviation communities DO-178 software assurance standard, intended to bring the same level of rigour to the software used for navigation and air traffic management. There’s just one tiny difference, while DO-178 use the word ‘certify’, DO-278 uses the word ‘approve’, and in that one word lies a vast difference in the effectiveness of these two standards.
DO-178C has traditionally been applied in the context of an independent certifier (such as the FAA or JAA) who does just that, certifies that the standard has been applied appropriately and that the design produced meets the standard. Certification is independent of the supplier/customer relationship, which has a number of clear advantages. First the certifying body is indifferent as to whether the applicant meets or does not meet the requirements of DO-178C so has greater credibility when certifying as they are clearly much less likely to suffer from any conflict of interest. Second, because there is one certifying agency there is consistent interpretation of the standard and the fostering and dissemination of corporate knowledge across the industry through advice from the regulator.
Turning to DO-278A we find that the term ‘approver’ has mysteriously (1) replaced the term ‘certify’. So who, you may ask, can approve? In fact what does approve mean? Well the long answer short is anyone can approve and it means whatever you make of it. What usually results in is the standard being invoked as part of a contract between supplier and customer, with the customer then acting as the ‘approver’ of the standards application. This has obvious and significant implications for the degree of trust that we can place in the approval given by the customer organisation. Unlike an independent certifying agency the customer clearly has a corporate interest in acquiring the system which may well conflict with the object of fully complying with the requirements of the standard. Give that ‘approval’ is given on a contract basis between two organisations and often cloaked in non-disclosure agreements there is also little to no opportunity for the dissemination of useful learnings as to how to meet the standard. Finally when dealing with previously developed software the question becomes not just ‘did you apply the standard?’, but also ‘who was it that actually approved your application?’ and ‘How did they actually interpret the standard?’.
So what to do about it? To my mind the unstated success factor for the original DO-178 standard was in fact the regulatory environment in which it was used. If you want DO-278A to be more than just a paper tiger then you should also put in place mechanism for independent certification. In these days of smaller government this is unlikely to involve a government regulator, but there’s no reason why (for example) the independent safety assessor concept embodied in IEC 61508 could not be applied with appropriate checks and balances (1). Until that happens though, don’t set too much store by pronouncements of compliance to DO-278.
Final thought, I’m currently renovating our house and have had to employ an independent certifier to sign off on critical parts of the works. Now if I have to do that for a home renovation, I don’t see why some national ANSP shouldn’t have to do it for their bright and shiny toys.
1. Perhaps Screwtape consultants were advising the committee. 🙂
2. One of the problems of how 61508 implement the ISA is that they’re still paid by the customer, which leads in turn to the agency problem. A better scheme would be an industry fund into which all players contribute and from which the ISA agent is paid.