The point of an investigation is not to find where people went wrong; it is to understand why their assessments and actions made sense at the time.
…for my boat is so small and the ocean so huge
For a small close knit community like the submarine service the loss of a boat and it’s crew can strike doubly hard. The USN’s response to this disaster, was both effective and long lasting. Doubly impressive given it was implemented at the height of the Cold War. As part of the course that I teach on system safety I use the Thresher as an important case study in organisational failure, and recovery.
The RAN’s Collins class Subsafe program derived it’s strategic principles in large measure from the USNs original program. The successful recovery of HMAS Dechaineux from a flooding incident at depth illustrates the success of both the RANs Subsafe program and also its antecedent.
Here’s a copy of the presentation that I gave at ASSC 2015 on how to use MIL-STD-882C to demonstrate compliance to the WHS Act 2011. The Model Australian Workplace Health and Safety (WHS) Act places new and quite onerous requirements upon manufacturer, suppliers and end users organisations. These new requirements include the requirement to demonstrate due diligence in the discharge of individual and corporate responsibilities. Traditionally contracts have steered clear of invoking Workplace Health and Safety (WHS) legislation in anything other than a most abstract form, unfortunately such traditional approaches provide little evidence with which to demonstrate compliance with the WHS act.
The presentation describes an approach to establishing compliance with the WHS Act (2011) using the combination of a contracted MIL-STD-882C system safety program and a compliance finding methodology. The advantages and effectiveness of this approach in terms of establishing compliance with the act and the effective discharge the responsibilities of both supplier and acquirer are illustrated using a case study of a major aircraft modification program. Limitations of the approach are then discussed given the significant difference between the decision making criteria of classic systems safety and the so far as is reasonably practicable principle.
The law of unintended consequences
There are some significant consequences to the principal of reasonable practicability enshrined within the Australian WHS Act. The act is particularly problematic for risk based software assurance standards, where risk is used to determine the degree of effort that should be applied. In part one of this three part post I’ll be discussing the implications of the act for the process industries functional safety standard IEC 61508, in the second part I’ll look at aerospace and their software assurance standard DO-178C then finally I’ll try and piece together a software assurance strategy that is compliant with the Act. Continue Reading…
Here’s a link to version 1.3 of System Safety Fundamentals, part of the course I teach at UNSW. I’ll be putting the rest of the course material up over the next couple of months. Enjoy :)
It is a common requirement to either load or update applications over the air after a distributed system has been deployed. For embedded systems that are mass market this is in fact a fundamental necessity. Of course once you do have an ability to load remotely there’s a back door that you have to be concerned about, and if the software is part of a vehicle’s control system or an insulin pump controller the consequences of leaving that door unsecured can be dire. To do this securely requires us to tackle the insecurities of the communications protocol head on.
One strategy is to insert a protocol ‘security layer’ between the stack and the application. The security layer then mediate between the application and the Stack to enforce the system’s overall security policy. For example the layer could confirm:
- that the software update originated from an authenticated source,
- that the update had not been modified,
- that the update itself had been authorised, and
- that the resources required by the downloaded software conform to any onboard safety or security policy.
There are also obvious economy of mechanism advantages when dealing with protocols like the TCP/IP monster. Who after all wants to mess around with the entirety of the TCP/IP stack, given that Richard Stevens took three volumes to define the damn thing? Similarly who wants to go through the entire process again when going from IP5 to IP6? :)
Interesting documentary on SBS about the Germanwings tragedy, if you want a deeper insight see my post on the dirty little secret of civilian aviation. By the way, the two person rule only works if both those people are alive.
Ladies and gentlemen you need to leave, like leave your luggage!
This has been another moment of aircraft evacuation Zen.
Or how I learned to stop worrying about trifles and love the Act
One of the Achilles heels of the current Australian WH&S legislation is that it provides no clear point at which you should stop caring about potential harm. While there are reasons for this, it does mean that we can end up with some theatre of the absurd moments where someone seriously proposes paper cuts as a risk of concern.
The traditional response to such claims of risk is to point out that actually the law rarely concerns itself with such trifles. Or more pragmatically, as you are highly unlikely to be prosecuted over a paper cut it’s not worth worrying about. Continue Reading…
The bond between a man and his profession is similar to that which ties him to his country; it is just as complex, often ambivalent, and in general it is understood completely only when it is broken: by exile or emigration in the case of one’s country, by retirement in the case of a trade or profession.
Defence in depth
One of the oft stated mantra’s of both system safety and cyber-security is that a defence in depth is required if you’re really serious about either topic. But what does that even mean? How deep? And depth of what exactly? Jello? Cacti? While such a statement has a reassuring gravitas, in practice it’s void of meaning unless you can point to an exemplar design and say there, that is what a defence in depth looks like. Continue Reading…
Paying down the debt
A great term that I’ve just come across, technical debt is a metaphor coined by Ward Cunningham to reflect on how a decision to act expediently for an immediate reason may have longer term consequences. This is a classic problem during design and development where we have to balance various ‘quality’ factors against cost and schedule. The point of the metaphor is that this debt doesn’t go away, the interest on that sloppy or expedient design solution keeps on getting paid every time you make a change and find that it’s harder than it should be. Turning around and ‘fixing’ the design in effect pays back the principal that you originally incurred. Failing to pay off the principal? Well such tales can end darkly. Continue Reading…
We don’t know what we don’t know
The Tacoma Narrows bridge stands, or rather falls, as a classic example of what happens when we run up against the limits of our knowledge. The failure of the bridge due to an as then unknown torsional aeroelastic flutter mode, which the bridge with it’s high span to width ratio was particularly vulnerable to, is almost a textbook example of ontological risk. Continue Reading…
An uneasy truth about the Challenger disaster
The story of Challenger in the public imagination could be summed up as ”’heroic’ engineers versus ’wicked’ managers”, which is a powerful myth but unfortunately just a myth. In reality? Well the reality is more complex and the causes of the decision to launch rest in part upon the failure of the participating engineers in the launch decision to clearly communicate the risks involved. Yes that’s right, the engineers screwed up in the first instance. Continue Reading…
Risk managers are the historians of futures that may never be.
I’ve rewritten my post on epistemic, aleatory and ontological risk pretty much completely, enjoy.
Qui enim teneat causas rerum futurarum, idem necesse est omnia teneat quae futura sint. Quod cum nem…
[Roughly, He who knows the causes will understand the future, except no-one but god possesses such faculty]
Why this bit of wreckage is unlikely to affect the outcome of the MH370 search
If this really is a flaperon from MH370 then it’s good news in a way because we could use wind and current data for the Indian ocean to determine where it might have gone into the water. That in turn could be used to update a probability map of where we think that MH370 went down, by adjusting our priors in the Bayesian search strategy. Thereby ensuring that all the information we have is fruitfully integrated into our search strategy.
Well… perhaps it could, if the ATSB were actually applying a Bayesian search strategy, but apparently they’re not. So the ATSB is unlikely to get the most out of this piece of evidence and the only real upside that I see to this is that it should shutdown most of the conspiracy nut jobs who reckoned MH370 had been spirited away to North Korea or some such. :)
We must contemplate some extremely unpleasant possibilities, just because we want to avoid them.
As quoted in ‘The New Nuclear Age’. The Economist, 6 March 2015
There are no facts, only interpretations…
More woes for OPM, and pause for thought for the proponents of centralized government data stores. If you build it they will come…and steal it.
Hannibal ante portas!
A recent article in Wired discloses how hospital drug pumps can be hacked and the firmware controlling them modified at will. Although in theory the comms module and motherboard should be separated by an air gap, in practice there’s a serial link cunningly installed to allow firmware to be updated via the interwebz.
As the Romans found, once you’ve built a road that a legion can march down it’s entirely possible for Hannibal and his elephants to march right up it. Thus proving once again, if proof be needed, that there’s nothing really new under the sun. In a similar vein we probably won’t see any real reform in this area until someone is actually killed or injured.
This has been another Internet of Things moment of zen.
A tale of another two reactors
There’s been much debate over the years as whether various tolerance of risk approaches actually satisfy the legal principle of reasonable practicability. But there hasn’t to my mind been much consideration of the value of simply adopting the legalistic approach in situations when we have a high degree of uncertainty regarding the likelihood of adverse events. In such circumstances basing our decisions upon what can turn out to be very unreliable estimates of risk can have extremely unfortunate consequences. Continue Reading…
The current Workplace Health and Safety (WHS) legislation of Australia formalises the common law principle of reasonable practicability in regard to the elimination or minimisation of risks associated with industrial hazards. Having had the advantage of going through this with a couple of clients the above flowchart is my interpretation of what reasonable practicability looks like as a process, annotated with cross references to the legislation and guidance material. What’s most interesting is that the process is determinedly not about tolerance of risk but instead firmly focused on what can reasonably and practicably be done. Continue Reading…
Safety versus security
There is a certain school of thought that views safety and security as essentially synonymous, and therefore that the principles of safety engineering are directly applicable to that of security, and vice versa. You might caricature this belief as the management idea that all one needs to do to generate a security plan is to take an existing safety plan and replace ‘safety’ with ‘security’ or ‘hazard’ with ‘threat’. A caricature yes, but one that’s not that much removed from reality :)
If you’re interested in observation selection effects Nick Bostrum’s classic on the subject is (I now find out) available online here. A classic example of this is Wald’s work on aircraft survivability in WWII, a naive observer would seek to protect those parts of the returning aircraft that were most damaged, however Wald’s insight was that these were in fact the least critical areas of the aircraft and that the area’s not damaged should actually be the one’s that were reinforced.
Just attended the Australian System Safety Conference, the venue was the Customs House right on River. Lots of excellent speakers and interesting papers, I enjoyed Drew Rae’s on tribalism in system safety particularly. The keynotes on resilience by John Bergstrom and cyber-security by Chris Johnson were also very good. I gave a presentation on the use of MIL-STD-882 as a tool for demonstrating compliance to the WHS Act, a subject that only a mother could love. Favourite moment? Watching the attendees faces when I told them that 61508 didn’t comply with the law. :)
Thanks again to Kate Thomson and John Davies for reviewing the legal aspects of my paper. Much appreciated guys.
The best defence of a secure system is openness
Ensuring the security of high consequence systems rests fundamentally upon the organisation that sustains that system. Thus organisational dysfunction can and does manifest itself as an inability to deal with security in an effective fashion. To that end the ‘shoot the messenger’ approach of the NSW Electoral Commission to reports of security flaws in the iVote electronic voting system does not bode well for that organisation’s ability to deal with such challenges. Continue Reading…
The Electronic Frontier Foundation reports that a flaw in the iVote system developed by the NSW Electoral Commission meant that up to 66,000 online votes, were vulnerable to online attack. Michigan Computer Science Professor J. Alex Halderman and University of Melbourne Research Fellow Vanessa Teague, who had previously predicted problems, found a weakness that would have allowed an untraceable man in the middle attack. The untraceable nature of that attack is important and we’ll get back to it. Continue Reading…
How to make rocket landings a bit
No one should underestimate how difficult landing a booster rocket is, let alone onto a robot barge that’s sitting in the ocean. The booster has to decelerate to a landing speed on a hatful of fuel, then maintain a fixed orientation to the deck while it descends, all the while counteracting the dynamic effects of a tall thin flexible airframe, fuel slosh, c of g changes, wind and finally landing gear bounce when you do hit. It’s enough to make an autopilot cry. Continue Reading…
Once again my hometown of Newcastle is being battered by fierce winds and storms, in yet another ‘storm of the century’, the scene above is just around the corner from my place in the inner city suburb of Cooks Hill. We’re now into our our second day of category two cyclonic winds and rain with many parts of the city flooded, and without power. Dungog a small town to the North of us is cut off and several houses have been swept off their piers there, three deaths are reported. My 8 minute walk to work this morning was an adventure to say the least.
A short tutorial on the architectural principles of integrity level partitioning, I wrote this a while ago, but the fundamentals remain the same. Partitioning is a very powerful design technique but if you apply it you also need to be aware that it can interact with all sorts of other system design attributes, like scheduling and fault tolerance to name but two.
The material is drawn from may different sources, which unfortunately at the time I didn’t reference, so all I can do is offer a general acknowledgement here. You can also find a more permanent link to the tutorial on my publications page.
The GAO has released its latest audit report on the FAA’s NextGen Air Traffic Management system. The reports updates the original GAO’s report and when read in conjunction with the original gives an excellent insight into how difficult cybersecurity can be across a national infrastructure program, like really, really difficult. At least they’re not trying to integrate military and civilian airspaces at the same time :)
My analogy is that on the cyber security front we’re effectively asking the FAA to hold a boulder over its head for the next five years or so without dropping it. And if security isn’t built into the DNA of NextGen? Well I leave it you dear reader to ponder the implications of that, in this ever more connected world of ours.
In celebration of upgrading the site to WP Premium here’s some gratuitous eye candy :)
A little more seriously, PIO is one of those problems that, contrary to what the name might imply, requires one to treat the aircraft and pilot as a single control system.
The problem of people
The Hal effect, named after the eponymous anti-hero of Stanley Kubrick and Arthur C. Clarke’s film 2001, is the tendency for designers to implicitly embed their cultural biases into automation. While such biases are undoubtedly a very uncertain guide it might also be worthwhile to look at the 2001 Odyssey mission from Hal’s perspective for a moment. Continue Reading…
Amidst all the online soul searching, and pontificating about how to deal with the problem of suicide by airliner, which is let’s face it still a very, very small risk, what you are unlikely to find is any real consideration of how we have arrived at this pass. There is as it turns out a simple one word answer, and that word is efficiency, the dirty little secret of the aviation industry. Continue Reading…
Quality must be considered as embracing all factors which contribute to reliable and safe operation. What is needed is an atmosphere, a subtle attitude, an uncompromising insistence on excellence, as well as a healthy pessimism in technical matters, a pessimism which offsets the normal human tendency to expect that everything will come out right and that no accident can be foreseen — and forestalled — before it happens
Bruce Schneier has a new book out on the battle underway for the soul of the surveillance society, why privacy is important and a few modest proposals on how to prevent us inadvertently selling our metadata birthright. You can find a description, reviews and more on the book’s website here. Currently sitting number six on the NYT’s non-fiction book list. Recommend it.
New Scientist has posted an online review of Bruce’s book here.
Or how to avoid the secret police reading your mail
Yaay! Our glorious government of Oceania has just passed the Data Retention Act 2015 with the support of the oh so loyal opposition. The dynamics of this is that both parties believe that ‘security’ is what’s called here in Oceania a ‘wedge’ issue so they strive to outdo each other in pandering to the demands of our erstwhile secret secret police, lest the other side gain political capital from taking a tougher position. It’s the political example of an evolutionary arms race with each cycle of legislation becoming more and more extreme.
As a result telco’s here are required to keep your metadata for three years so that the secret police can paw through the electronic equivalent of your rubbish bin any time they choose. For those who go ‘metadata huh?’ metadata is all the add on information that goes with your communications via the interwebz, like where your email went, and where you were when you made a call at 1.33 am in the morning to your mother, so just like your rubbish bin it can tell the secret police an awful lot about you, especially when you knit it up with other information. Continue Reading…
The Germanwings A320 crash
At this stage there’s not more that can be said about the particulars of this tragedy that has claimed a 150 lives in a mountainous corner of France. Disturbingly again we have an A320 aircraft descending rapidly and apparently out of control, without the crew having any time to issue a distress call. Yet more disturbing is the though that the crash might be due to the crew failing to carry out the workaround for two blocked AoA probes promulgated in this Emergency Airworthiness Directive (EAD) that was issued in December of last year. And, as the final and rather unpleasant icing on this particular cake, there is the followup question as to whether the problem covered by the directive might also have been a causal factor in the AirAsia flight 8501 crash. That, if it be the case, would be very, very nasty indeed.
Unfortunately at this stage the answer to all of the above questions is that no one knows the answer, especially as the Indonesian investigators have declined to issue any further information on the causes of the Air Asia crash. However what we can be sure of is that given the highly dependable nature of aircraft systems the answer when it comes will comprise an apparently unlikely combinations of events, actions and circumstance, because that is the nature of accidents that occur in high dependability systems. One thing that’s also for sure, there’ll be little sleep in Toulouse until the FDRs are recovered, and maybe not much after that….
if having read the EAD your’e left wondering why it directed that two ADR’s be turned off it’s simply that by doing so you push the aircraft out of what’s called Normal law, where Alpha protection is trying to drive the nose down, into Alternate law, where the (erroneous) Alpha protection is removed. Of course in order to do so you need to be able to recognise, diagnose and apply the correct action, which also generally requires training.
The more things change, the more they stay the same…
The Saturn second stage was built by North American Aviation at its plant at Seal Beach, California, shipped to NASA’s Marshall Space Flight Center, Huntsville, Alabama, and there tested to ensure that it met contract specifications. Problems developed on this piece of the Saturn effort and Wernher von Braun began intensive investigations. Essentially his engineers completely disassembled and examined every part of every stage delivered by North American to ensure no defects. This was an enormously expensive and time-consuming process, grinding the stage’s production schedule almost to a standstill and jeopardizing the Presidential timetable.
When this happened Webb told von Braun to desist, adding that “We’ve got to trust American industry.” The issue came to a showdown at a meeting where the Marshall rocket team was asked to explain its extreme measures. While doing so, one of the engineers produced a rag and told Webb that “this is what we find in this stuff.” The contractors, the Marshall engineers believed, required extensive oversight to ensure they produced the highest quality work.
And if Marshall hadn’t been so persnickety about quality? Well have a look at the post Apollo 1 fire accident investigation for the results of sacrificing quality (and safety) on the alter of schedule.
Apollo: A Retrospective Analysis, Roger D. Launius, July 1994, quoted in “This Is What We Find In This Stuff: A Designer Engineer’s View”, Presentation, Rich Katz, Grunt Engineer NASA Office of Logic Design, FY2005 Software/Complex Electronic Hardware Standardization Conference, Norfolk, Virginia July 26-28, 2005.
Bayes and the search for MH370
We are now approximately 60% of the way through searching the MH370 search area, and so far nothing. Which is unfortunate because as the search goes on the cost continues to go up for the taxpayer (and yes I am one of those). What’s more unfortunate, and not a little annoying, is that that through all this the ATSB continues to stonily ignore the use of a powerful search technique that’s been used to find everything from lost nuclear submarines to the wreckage of passenger aircraft. Continue Reading…
Here’s an interesting graph that compares Class A mishap rates for USN manned aviation (pretty much from float plane to Super-Hornet) against the USAF’s drone programs. Interesting that both programs steadily track down decade by decade, even in the absence of formal system safety programs for most of the time (1).
The USAF drone program start out with around the 60 mishaps per 100,000 flight hour rate (equivalent to the USN transitioning to fast jets at the close of the 1940s) and maintains a steeper decrease rate that the USN aviation program. As a result while the USAF drones program is tail chasing the USN it still looks like it’ll hit parity with the USN sometime in the 2040s.
So why is the USAF drone program doing better in pulling down the accident rate, even when they don’t have a formal MIL-STD-882 safety program?
Well for one a higher degree of automation does have comparitive advantages. Although the USN’s carrier aircraft can do auto-land, they generally choose not to, as pilot’s need to keep their professional skills up, and human error during landing/takeoff inevitably drives the mishap rate up. Therefore a simple thing like implementing an auto-land function for drones (landing a drone is as it turns out not easy) has a comparatively greater bang for your safety buck. There’s also inherently higher risks of loss of control and mid air collision when air combat manoeuvring, or running into things when flying helicopters at low level which are operational hazards that drones generally don’t have to worry about.
For another, the development cycle for drones tends to be quicker than manned aviation, and drones have a ‘some what’ looser certification regime, so improvements from the next generation of drone design tend to roll into an expanding operational fleet more quickly. Having a higher cycle rate also helps retain and sustain the corporate memory of the design teams.
Finally there’s the lessons learned effect. With drones the hazards usually don’t need to be identified and then characterised. In contrast with the early days of jet age naval aviation the hazards drone face are usually well understood with well understood solutions, and whether these are addressed effectively has more to do with programmatic cost concerns than a lack of understanding. Conversely when it actually comes time to do something like put de-icing onto a drone, there’s a whole lot of experience that can be brought to bear with a very good chance of first time success.
A final question. Looking at the above do we think that the application of rigorous ‘FAA like’ processes or standards like ARP 4761, ARP 4754 and DO-178 would really improve matters?
Hmmm… maybe not a lot.
1. As a historical note while the F-14 program had the first USN aircraft system safety program (it was a small scale contractor in house effort) it was actually the F/A-18 which had the first customer mandated and funded system safety program per MIL-STD-882. USAF drone programs have not had formal system safety programs, as far as I’m aware.