Archives For Safety

The practice of safety engineering in various high consequence industries.

Spruiking zero harm or crusading safety ‘because you care’ raises as much suspicion as having a folder on your computer named ‘DEFINITELY NOT PORN’ – would you get on a plane that had “ZERO CRASH” emblazoned all over it?

David Collins

…it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

Hans Moravec

Uber’s safety management woes

We shouldn’t be killing people in our haste to get to a safe future

Dr Phil Koopman (on driverless cars)

Here’s a view from inside Tesla by one of it’s former employees. Taking the report at face value, which is of course an arguable proposition, you can see how technical debt can build up to a point where it’s near impossible to pay it down. That in turn can have significant effects on the safety performance of the organisation, see the Toyota spaghetti code case as another example. The take home for this is for any software safety effort it’s a good idea to see whether the company/team is measuring technical debt in a meaningful fashion and are actively retiring it, for example by alternating capability and maintenance updates.

Tesla and technical debt.

https://mobile.twitter.com/atomicthumbs/status/1032939617404645376

It is highly questionable whether total system safety is always enhanced by allocating functions to automatic devices rather than human operators, and there is some reason to believe that flight-deck automation may have already passed its optimum point.

Earl Wiener (1980)

If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.

AI winter

Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.

Piece of wing found on La Réunion Island, is that could be flap of #MH370 ? Credit: reunion 1ere

The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470  this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.

As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).

One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.

I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?

Notes

1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.

2. No go alone rules are unfortunately only partially effective.

References

Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.

People must retain control of autonomous vehicles

So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?

In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:

  1. Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
  2. Provide a fallback AoA source using an AoA estimator.
  3. Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
  4. Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
  5. Use non-aristotlean logic to better model the trustworthiness of air data.
  6. Provide the current master/slave hierarchy status amongst voting channels to aircrew.
  7. Provide an obvious and intuitive way to  to remove a faulted channel allowing flight under reversionary laws (7).
  8. Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).

As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.

Notes

1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.

2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.

3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.

4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).

5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.

6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.

7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.

8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.

One of the perennial problems we face in a system safety program is how to come up with a convincing proof for the proposition that a system is safe. Because it’s hard to prove a negative (in this case the absence of future accidents) the usual approach is to pursue a proof by contradiction, that is develop the negative proposition that the system is unsafe, then prove that this is not true, normally by showing that the set of identified specific propositions of `un-safety’ have been eliminated or controlled to an acceptable level.  Enter the term `hazard’, which in this context is simply shorthand for  a specific proposition about the unsafeness of a system. Now interestingly when we parse the set of definitions of hazard we find the recurring use of terms like, ‘condition’, ‘state’, ‘situation’ and ‘events’ that should they occur will inevitably lead to an ‘accident’ or ‘mishap’. So broadly speaking a hazard is a explanation based on a defined set of phenomena, that argues that if they are present, and given there exists some relevant domain source (1) of hazard an accident will occur. All of which seems to indicate that hazards belong to a class of explanatory models called covering laws. As an explanatory class Covering laws models were developed by the logical positivist philosophers Hempel and Popper because of what they saw as problems with an over reliance on inductive arguments as to causality.

As a covering law explanation of unsafeness a hazard posits phenomenological facts (system states, human errors, hardware/software failures and so on) that confer what’s called nomic expectability on the accident (the thing being explained). That is, the phenomenological facts combined with some covering law (natural and logical), require the accident to happen, and this is what we call a hazard. We can see an archetypal example in the Source-Mechanism-Outcome model of Swallom, i.e. if we have both a source and a set of mechanisms in that model then we may expect an accident (Ericson 2005). While logical positivism had the last nails driven into it’s coffin by Kuhn and others in the 1960s and it’s true, as Kuhn and others pointed out, that covering model explanations have their fair share of problems so to do other methods (2). The one advantage that covering models do possess over other explanatory models however is that they largely avoid the problems of causal arguments. Which may well be why they persist in engineering arguments about safety.

Notes

1. The source in this instance is the ‘covering law’.

2. Such as counterfactual, statistical relevance or causal explanations.

References

Ericson, C.A. Hazard Analysis Techniques for System Safety, page 93, John Wiley and Sons, Hoboken, New Jersey, 2005.

The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.

Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.

“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”

Capt. Kevin Sullivan (QF72 flight)

From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.

“Damn-it! We’re going to crash. It can’t be true! (copilot #1)

“But, what’s happening? copilot #2)

AF447 CVR transcript (final words)

Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident.  In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.

 

With the NSW Rural Fire Service fighting more than 50 fires across the state and the unprecedented hellish conditions set to deteriorate even further with the arrival of strong winds the question of the day is, exactly how bad could this get? The answer is unfortunately, a whole lot worse. That’s because we have difficulty as human beings in thinking about and dealing with extreme events… To quote from this post written in the aftermath of the 2009 Victorian Black Saturday fires.

So how unthinkable could it get? The likelihood of a fire versus it’s severity can be credibly modelled as a power law a particular type of heavy tailed distribution (Clauset et al. 2007). This means that extreme events in the tail of the distribution are far more likely than predicted by a gaussian (the classic bell curve) distribution. So while a mega fire ten times the size of the Black Saturday fires is far less likely it is not completely improbable as our intuitive availability heuristic would indicate. In fact it’s much worse than we might think, in heavy tail distributions you need to apply what’s called the mean excess heuristic which really translates to the next worst event is almost always going to be much worse…

So how did we get to this?  Simply put the extreme weather we’ve been experiencing is a tangible, current day effect of climate change. Climate change is not something we can leave to our children to really worry about, it’s happening now. That half a degree rise in global temperature? Well it turns out it supercharges the occurrence rate of extremely dry conditions and the heavy tail of bushfire severity. Yes we’ve been twisting the dragon’s tail and now it’s woken up…

2019 Postscript: Monday 11 November 2019 – NSW

And here we are in 2019 two years down the track from the fires of 2017 and tomorrow looks like being a beyond catastrophic fire day. Firestorms are predicted.

To err is inhuman

10/11/2016

Screwtape(Image source: end time info)

More infernal statistics

Well, here we are again. Given recent developments in the infernal region it seems like a good time for another post. Have you ever, dear reader, been faced with the problem of how to achieve an unachievable safety target? Well worry no longer! Herewith is Screwtape’s patented man based mitigation medicine.

The first thing we do is introduce the concept of ‘mitigation’, ah what a beautiful word that is. You see it’s saying that it’s OK that your system doesn’t meet its safety target, because you can claim credit for the action of an external mitigator in the environment. Probability wise if the probability of an accident is P_a then P_a equals the product of your systems failure probability P_s and. the probability that some external mitigation also fails P_m or P_a = P_s X P_m. 

So let’s use operator intervention as our mitigator, lovely and vague. But how to come up with a low enough P_m? Easy, we just look at the accident rate that has occurred for this or a like system and assume that these were due to operator mitigation being unsuccessful. Voila, we get our really small numbers. 

Now, an alert reader might point out that this is totally bogus and that P_m is actually the likelihood of operator failure when the system fails. Operators failing, as those pestilential authors of the WASH1400 study have pointed out, is actually quite likely. But I say, if your customer is so observant and on the ball then clearly you are not doing your job right. Try harder or I may eat your soul, yum yum. 

Yours hungrily, 

Screwtape.

About time I hear you say! 🙂

Yes I’ve just rewritten a post on functional failure taxonomies to include how to use them to gauge the completeness of your analysis. This came out of a question I was asked in a workshop that went something like, ‘Ok mr big-shot consultant tell us, exactly how do we validate that our analysis is complete?’. That’s actually a fair question, standards like EUROCONTROL’s SAM Handbook and ARP 4761 tell you you ought to, but are not that helpful in the how to do it department. Hence this post.

Using a taxonomy to determine the coverage of the analysis is one approach to determining completeness. The other is to perform at least two analyses using different techniques and then compare the overlap of hazards using a capture/recapture technique. If there’s a high degree of overlap you can be confident there’s only a small hidden population of hazards as yet unidentified. If there’s a very low overlap, you may have a problem.

The 15 commandments

04/09/2016

10cmdts

The 15 commandments of the god  of the machine

Herewith, are the 15 commandments for thine safety critical software as spoken by the machine god unto his prophet Kopetz.

  1. Thou shalt regard the system safety case as thy tabernacle of safety and derive thine critical software failure modes and requirements from it.
  2. Thou shalt adopt a fundamentally safe architecture and define thy fault tolerance hypothesis as part of this. Even unto the definition of fault containment regions, their modes of failure and likelihood.
  3. Thine fault tolerance shall include start-up operating and shutdown states
  4. Thine system shall be partitioned to ‘divide and conquer’ the design. Yea such partitioning shall include the precise specification of component interfaces by time and value such that  all manner of men shall comprehend them
  5. Thine project team shall develop a consistent model of time and state for even unto the concept of states and fault recovery by voting is the definition of time important.
  6. Yea even though thou hast selected a safety architecture pleasing to the lord, yet it is but a house built upon the sand, if no ‘programming in the small’ error detection and fault recovery is provided.
  7. Thou shall ensure that errors are contained and do not propagate through the system for a error idly propagated  to a service interface is displeasing to the lord god of safety and invalidates your righteous claims of independence.
  8. Thou shall ensure independent channels and components do not have common mode failures for it is said that homogenous redundant channels protect only from random hardware failures  neither from the common external cause such as EMI or power loss, nor from the common software design fault.
  9. Thine voting software shall follow the self-confidence principle for it is said that if the self-confidence principle is observed then a correct FCR will always make the correct decision under the assumption of a single faulty FCR, and only a faulty FCR will make false decisions.
  10. Thou shall hide and separate thy fault-tolerance mechanisms so that they do not introduce fear, doubt and further design errors unto the developers of the application code.
  11. Thou shall design your system for diagnosis for it is said that even a righteously designed fault tolerant system my hide such faults from view whereas thy systems maintainers must replace the affected LRU.
  12. Thine interfaces shall be helpful and forgive the operator his errors neither shall thine system dump the problem in the operators lap without prior warning of impending doom.
  13. Thine software shall record every single anomaly for your lord god requires that every anomaly observed during operation must be investigated until a root cause is defined
  14. Though shall mitigate further hazards introduced by your design decisions for better it is that you not program in C++ yet still is it righteous to prevent the dangling of thine pointers and memory leaks
  15. Though shall develop a consistent fault recovery strategy such that even in the face of violations of your fault hypothesis thine system shall restart and never give up.

MH370 underwater search area map (Image source- Australian Govt)

After millions of dollars and years of effort the ATSB has suspended it’s search for the wreck of MH370. There’s some bureaucratic weasel words, but we are done people. Of course had the ATSB applied Bayesian search techniques, as the USN did in the successful search for it’s missing  USS Scorpion, we might actually know where it is.

M1 Risk_Spectrum_redux

A short article on (you guessed it) risk, uncertainty and unpleasant surprises for the 25th Anniversary issue of the UK SCS Club’s Newsletter, in which I introduce a unified theory of risk management that brings together aleatory, epistemic and ontological risk management and formalises the Rumsfeld four quadrant risk model which I’ve used for a while as a teaching aid.

My thanks once again to Felix Redmill for the opportunity to contribute.  🙂

Joshua Brown screen grabKeep your eyes on the road, and your hands upon the wheel…

The first fatality involving the use of Tesla’s autopilot* occurred last May. The Guardian reported that the autopilot sensors on the Model S failed to distinguish a white tractor-trailer crossing the highway against a bright sky and promptly tried to drive under the trailer, with decapitating results. What’s emerged is that the driver had a history of driving at speed and also of using the automation beyond the maker’s intent, e.g. operating the vehicle hands off rather than hands on, as the screen grab above indicates. Indeed recent reports indicate that immediately prior to the accident he was travelling fast (maybe too fast) whilst watching a Harry Potter DVD. There also appears to be a community of like minded spirits out there who are intent on seeing how far they can push the automation… sigh.  Continue Reading…

System Safety Fundamentals Concept Cloud

Have just updated the safety case module for the system safety course I teach at UNSW. Have revised it to include John Rushby’s approach to determining the soundness and strength of a safety argument (I like his simplification and separation of concerns strategy). Enjoy!

Morely.png

Why writing a safety case might (actually) be a good idea

Frequent readers of my blog would probably realise that I’m a little sceptical of safety cases, as Scrooge remarked to Morely’s ghost, “There’s more of gravy than of grave about you, whatever you are!” So to for safety cases, oft more gravy than gravitas about them in my opinion, regardless of what their proponents might think.

Continue Reading…

Here’s a short presentation  I gave on the ramifications of the Australian model WHS Act (2011) for engineers when they engage in, or oversee design. The act is a complex beast, and the ramifications of it have not yet fully sunk into the engineering community.

One particularly contentious area is the application of the act to plant and materials that are imported. While the guidance material for the act gives the example of a supplier performing additional testing of the goods to demonstrate it meets Australian Standards, the reality is well, a little different.

Monkey-typing

Safety cases and that room full of monkeys

Back in 1943, the French mathematician Émile Borel published a book titled Les probabilités et la vie, in which he stated what has come to be called Borel’s law which can be paraphrased as, “Events with a sufficiently small probability never occur.” Continue Reading…

Just updated the course notes for safety cases and argument to include more on how to represent safety cases if you are not graphically inclined. All in preparation for the next system safety course in July 2016 at ADFA places still open folk! A tip o’ the hat to Chris Holloway whose work prompted the additional material. 🙂

IMG_4880

A very late draft

I originally gave a presentation at the 2015 ASSC Conference, but never got around to finishing the companion paper. Not my most stellar title either. The paper is basically how to leverage off the very close similarities between the objectives of the WHS Act (2011) and those of MIL-STD882C, yes that standard. You might almost think the drafters of the legislation had a system safety specialist advising them… Recently I’ve had the necessity to apply this approach on another project (a ground system this time) so I took the opportunity to update the draft paper as an aide memoire, and here it is!

🙂

4blackswans

One of the problems that we face in estimating risk driven is that as our uncertainty increases our ability to express it in a precise fashion (e.g. numerically) weakens to the point where for deep uncertainty (1) we definitionally cannot make a direct estimate of risk in the classical sense. Continue Reading…

System Safety Fundamentals Concept Cloud

System safety course, now with more case studies and software safety!

Have just added a couple of case studies and some course notes of software hazards and integrity partitioning, because hey I know you guys love that sort of stuff 🙂

Safety course notes

03/11/2015

System Safety Fundamentals Concept Cloud

I have finally got around to putting my safety course notes up, enjoy. You can also find them off the main menu.

Feel free to read and use under the terms of the associated creative commons license. I’d note that these are course notes so I use a large amount of example material from other sources (because hey, a good example is a good example right?) and where I have a source these are acknowledged in the notes. If you think I’ve missed a citation or made an error, then let me know.

To err is human, but to really screw it up takes a team of humans and computers…

How did a state of the art cruiser operated by one of the worlds superpowers end up shooting down an innocent passenger aircraft? To answer that question (at least in part) here’s a case study that’s part of the system safety course I teach that looks at some of the casual factors in the incident.

In the immediate aftermath of this disaster there was a lot of reflection, and work done, on how humans and complex systems interact. However one question that has so far gone unasked is simply this. What if the crew of the USS Vincennes had just used the combat system as it was intended? What would have happened if they’d implemented a doctrinal ruleset that reflected the rules of engagement that they were operating under and simply let the system do its job? After all it was not the software that confused altitude with range on the display, or misused the IFF system, or was confused by track IDs being recycled… no, that was the crew.

Consider the effect that the choice of a single word can have upon the success or failure of a standard.The standard is DO-278A, and the word is, ‘approve’. DO-278 is the ground worlds version of the aviation communities DO-178 software assurance standard, intended to bring the same level of rigour to the software used for navigation and air traffic management. There’s just one tiny difference, while DO-178 use the word ‘certify’, DO-278 uses the word ‘approve’, and in that one word lies a vast difference in the effectiveness of these two standards.

DO-178C has traditionally been applied in the context of an independent certifier (such as the FAA or JAA) who does just that, certifies that the standard has been applied appropriately and that the design produced meets the standard. Certification is independent of the supplier/customer relationship, which has a number of clear advantages. First the certifying body is indifferent as to whether the applicant meets or does not meet the requirements of DO-178C so has greater credibility when certifying as they are clearly much less likely to suffer from any conflict of interest. Second, because there is one certifying agency there is consistent interpretation of the standard and the fostering and dissemination of corporate knowledge across the industry through advice from the regulator.

Turning to DO-278A we find that the term ‘approver’ has mysteriously (1) replaced the term ‘certify’. So who, you may ask, can approve? In fact what does approve mean? Well the long answer short is anyone can approve and it means whatever you make of it. What usually results in is the standard being invoked as part of a contract between supplier and customer, with the customer then acting as the ‘approver’ of the standards application. This has obvious and significant implications for the degree of trust that we can place in the approval given by the customer organisation. Unlike an independent certifying agency the customer clearly has a corporate interest in acquiring the system which may well conflict with the object of fully complying with the requirements of the standard. Give that ‘approval’ is given on a contract basis between two organisations and often cloaked in non-disclosure agreements there is also little to no opportunity for the dissemination of useful learnings as to how to meet the standard. Finally when dealing with previously developed software the question becomes not just ‘did you apply the standard?’, but also ‘who was it that actually approved your application?’ and ‘How did they actually interpret the standard?’.

So what to do about it? To my mind the unstated success factor for the original DO-178 standard was in fact the regulatory environment in which it was used. If you want DO-278A to be more than just a paper tiger then you should also put in place mechanism for independent certification. In these days of smaller government this is unlikely to involve a government regulator, but there’s no reason why (for example) the independent safety assessor concept embodied in IEC 61508 could not be applied with appropriate checks and balances (1). Until that happens though, don’t set too much store by pronouncements of compliance to DO-278.

Final thought, I’m currently renovating our house and have had to employ an independent certifier to sign off on critical parts of the works. Now if I have to do that for a home renovation, I don’t see why some national ANSP shouldn’t have to do it for their bright and shiny toys.

Notes

1. Perhaps Screwtape consultants were advising the committee. 🙂

2. One of the problems of how 61508 implement the ISA is that they’re still paid by the customer, which leads in turn to the agency problem. A better scheme would be an industry fund into which all players contribute and from which the ISA agent is paid.

C1Thresher

…for my boat is so small and the ocean so huge

For a small close knit community like the submarine service the loss of a boat and it’s crew can strike doubly hard. The USN’s response to this disaster, was both effective and long lasting. Doubly impressive given it was implemented at the height of the Cold War. As part of the course that I teach on system safety I use the Thresher as an important case study in organisational failure, and recovery.

Postscript

The RAN’s Collins class Subsafe program derived it’s strategic principles in large measure from the USNs original program. The successful recovery of HMAS Dechaineux from a flooding incident at depth illustrates the success of both the RANs Subsafe program and also its antecedent.

Interesting documentary on SBS about the Germanwings tragedy, if you want a deeper insight see my post on the dirty little secret of civilian aviation. By the way, the two person rule only works if both those people are alive.

What burns in Vegas…

Ladies and gentlemen you need to leave, like leave your luggage!

This has been another moment of aircraft evacuation Zen.

Lady Justice (Image source: Jongleur CC-BY-SA-3.0)

Or how I learned to stop worrying about trifles and love the Act

One of the Achilles heels of the current Australian WH&S legislation is that it provides no clear point at which you should stop caring about potential harm. While there are reasons for this, it does mean that we can end up with some theatre of the absurd moments where someone seriously proposes paper cuts as a risk of concern.

The traditional response to such claims of risk is to point out that actually the law rarely concerns itself with such trifles. Or more pragmatically, as you are highly unlikely to be prosecuted over a paper cut it’s not worth worrying about. Continue Reading…

Piece of wing found on La Réunion Island, is that could be flap of #MH370 ? Credit: reunion 1ere

Piece of wing found on La Réunion Island (Image source: reunion 1ere)


Why this bit of wreckage is unlikely to affect the outcome of the MH370 search

If this really is a flaperon from MH370 then it’s good news in a way because we could use wind and current data for the Indian ocean to determine where it might have gone into the water. That in turn could be used to update a probability map of where we think that MH370 went down, by adjusting our priors in the Bayesian search strategy. Thereby ensuring that all the information we have is fruitfully integrated into our search strategy.

Well… perhaps it could, if the ATSB were actually applying a Bayesian search strategy, but apparently they’re not. So the ATSB is unlikely to get the most out of this piece of evidence and the only real upside that I see to this is that it should shutdown most of the conspiracy nut jobs who reckoned MH370 had been spirited away to North Korea or some such. 🙂

The offending PCA serial cable linking the comms module to the motherboard (Image source: Billy Rios)

Hannibal ante portas!

A recent article in Wired discloses how hospital drug pumps can be hacked and the firmware controlling them modified at will. Although in theory the comms module and motherboard should be separated by an air gap, in practice there’s a serial link cunningly installed to allow firmware to be updated via the interwebz.

As the Romans found, once you’ve built a road that a legion can march down it’s entirely possible for Hannibal and his elephants to march right up it. Thus proving once again, if proof be needed, that there’s nothing really new under the sun. In a similar vein we probably won’t see any real reform in this area until someone is actually killed or injured.

This has been another Internet of Things moment of zen.

rocket-landing-attempt (Image source- Space X)

How to make rocket landings a bit safer easier

No one should underestimate how difficult landing a booster rocket is, let alone onto a robot barge that’s sitting in the ocean. The booster has to decelerate to a landing speed on a hatful of fuel, then maintain a fixed orientation to the deck while it descends, all the while counteracting the dynamic effects of a tall thin flexible airframe, fuel slosh, c of g changes, wind and finally landing gear bounce when you do hit. It’s enough to make an autopilot cry. Continue Reading…

Here’s a companion tutorial to the one on integrity level partitioning. This addresses more general software hazards and how to deal with them. Again you can find a more permanent link on my publications page. Enjoy 🙂

In celebration of upgrading the site to WP Premium here’s some gratuitous eye candy 🙂

A little more seriously, PIO is one of those problems that, contrary to what the name might imply, requires one to treat the aircraft and pilot as a single control system.

The problem with people

The HAL effect, named after the eponymous anti-hero of Stanley Kubrick and Arthur C. Clarke’s film 2001, is the tendency for designers to implicitly embed their cultural biases into automation. While such biases are undoubtedly a very uncertain guide it might also be worthwhile to look at the 2001 Odyssey mission from HAL’s perspective for a moment. Here we have the classic long duration space mission with a minimalist two man complement for the cruise phase. The crew and the ship are on their own. In fact they’re about as isolated as it’s possible to be as human beings, and help is a very, very long way away. Now from HAL’s perspectives humans are messy, fallible creatures prone to making basic errors in even the most routine of tasks. Not to mention that they annoyingly use emotion to inform even the most basic of decisions. Then there’s the added complication that they’re social creatures apt in even the most well matched of groups to act in ways that a dispassionate external observer could only consider as confusing and often dysfunctional. Finally they break, sometimes in ways that can actively endanger others and the mission itself.

So from a mission assurance perspective would it really be appropriate to rely on a two man crew in the vastness of space? The answer is clearly no, even the most well adjusted of cosmonauts can exhibit psychological problems when isolated in the vastness of space. While a two man crew may be attractive from a cost perspective it’s still vulnerable to a single point of human failure. Or to put it more brutally murder and suicide are much more likely to be successful in small crews. Such scenarios however dark they may be need to be guarded against if we intended to use a small crew. But how to do it? If we add more crew to the cruise phase complement then we also add all the logistics tail that goes along with it, and our mission may become unviable. Even if cost were not a consideration small groups isolated for long periods are prone to yet other forms of psychological dysfunctions (1). Humans it seems exhibit a set of common mode failures that are difficult to deal with, so what to do?

Well, one way to guard against common mode failures is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to human affect driven processing. Of course to be effective we’re talking a high end AI, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next (2). With that insight goes the requirement for autonomy in vetoing illogical and patently hazardous crew actions, e.g “I’m sorry Dave but I’m afraid I can’t  let you override the safety interlocks on the reactor fuel feed…“. From that perspective we might have some sympathy for HAL’s reaction to his other crew mates plotting his cybernetic demise.

Which may all seem a little far fetched after all an AI of that sophistication is another twenty to thirty years away, and long duration deep space missions are probably that far away as well. On the other hand there’s currently a quiet conversation going on in the aviation industry about the next step for automation in the cockpit, e.g. one pilot in the cockpit of large airliners. After all, so the argument goes, pilot’s are expensive beasts and with the degree of automated support available to day, surely we don’t need two men in the cockpit? Well, if we’re thinking purely about economics then sure one could make that argument, but on the other hand as the awful reality of the Germanwings tragedy sinks in we also need to understand that people are simply not perfect, and that sometimes (very rarely (3)) they can fail catastrophically. Given that we know that reducing crew levels down to two increases the risk of successful suicide by airliner one could ask what happens to the risk if we go to single pilot operations? I think we all know what the answer to that would be.

Where is a competent AI (HAL 9000) when you need one? 🙂

Notes

1. From polar exploration we know that small exploratory teams of three persons are socially unstable and should be avoided. Which then drives the team size to four.

2. As an aside, the inability of HAL to understand the basics of human motivation always struck me as a false note in Kubrick’s 2001 movie. An AI as smart as Hal apparently was, and yet lacking even an undergraduate understanding of human psychology, maybe not.

3. Remember that we are in the tail of the aviation safety program where we are trying to mitigate hazards whose likelihoods are very, very rare. However given that they aren’t mitigated they dominate the residual statistic.

Comet (Image source: Public domain)

Amidst all the soul searching, and pontificating about how to deal with the problem of pilot’s ‘suiciding by airliner’, you are unlikely to find any real consideration of how we have arrived at this pass. There is as it turns out a simple one word answer, and that word is efficiency. back when airliner’s first started to fly  they needed a big aircrew, for example  on the Comet airliner you’d find a pilot, copilot, navigator and flight engineer. Now while that’s a lot of manpower to pay for it did possess one hidden advantage, and that was with a crew size greater than three it’s very, very difficult (OK effectively impossible) for any one member of the flight crew to attempt to commit suicide. If you think I exaggerate then go see if there has ever been a successful suicide by airliner where there were three or more aircrew in the cockpit. Nope, none. But, the aviation industry is one driven by cost. Each new generation of aircraft needs to be cheaper to operate which means that the airlines and airline manufacturers are locked in a ruthless evolutionary arms race to do more with less. One of the easiest ways to reduce operating costs is to reduce the number of aircrew needed to fly the big jets. Fewer aircrew, greater automation is an equation that delivers more efficient operations. And before you the traveller get too judgemental about all this just remember that the demand for cost reduction is in turn driven by our expectation as consumers that airlines can provide cheap mass airfare for the common man.

So we’ve seen the number of aircrew slowly reduce over the years, first the navigator went and then the flight engineer until we finally arrived at our current standard two man flight crew. There’s just one small problem with this, if one of those pilots wants to dispose of the other there’s not a whole lot that can be done to prevent it. In our relentless pursuit of efficiency we have inadvertently eliminated a safety margin that we didn’t even realise was there. So what can we really do about it? Well the simple ‘we know it works’ answer is to go back to three crew in the cockpit, which effectively eliminates the hazard, of course that’s also a solution that’s unlikely to be taken up. In the absence of going back to three man crews well, we get what we’re currently getting, aspirational statements about better management of stress and depression in aircrew, or the use of cabin crew to enforce no go alone rules. But when that cockpit door is closed it’s still one on one, and all such measures do in the final analysis is reduce the likelihood of the hazard, by some hard to quantify amount, they don’t eliminate it. As long as we fly two man crews behind armoured doors unfortunately the possibility and therefore the hazard remains.

Happy flying 🙂

Germanwings crash

Another A320 crash

25/03/2015

Germanwings crash (Image source: AFP)

The Germanwings A320 crash

At this stage there’s not more that can be said about the particulars of this tragedy that has claimed a 150 lives in a mountainous corner of France. Disturbingly again we have an A320 aircraft descending rapidly and apparently out of control, without the crew having any time to issue a distress call. Yet more disturbing is the though that the crash might be due to the crew failing to carry out the workaround for two blocked AoA probes promulgated in this Emergency Airworthiness Directive (EAD) that was issued in December of last year. And, as the final and rather unpleasant icing on this particular cake, there is the followup question as to whether the problem covered by the directive might also have been a causal factor in the AirAsia flight 8501 crash. That, if it be the case, would be very, very nasty indeed.

Unfortunately at this stage the answer to all of the above questions is that no one knows the answer, especially as the Indonesian investigators have declined to issue any further information on the causes of the Air Asia crash. However what we can be sure of is that given the highly dependable nature of aircraft systems the answer when it comes will comprise an apparently unlikely combinations of events, actions and circumstance, because that is the nature of accidents that occur in high dependability systems. One thing that’s also for sure, there’ll be little sleep in Toulouse until the FDRs are recovered, and maybe not much after that….

Postscript

if having read the EAD your’e left wondering why it directed that two ADR’s be turned off it’s simply that by doing so you push the aircraft out of what’s called Normal law, where Alpha protection is trying to drive the nose down, into Alternate law, where the (erroneous) Alpha protection is removed. Of course in order to do so you need to be able to recognise, diagnose and apply the correct action, which also generally requires training.

MH370 underwater search area map (Image source- Australian Govt)

Bayes and the search for MH370

We are now approximately 60% of the way through searching the MH370 search area, and so far nothing. Which is unfortunate because as the search goes on the cost continues to go up for the taxpayer (and yes I am one of those). What’s more unfortunate, and not a little annoying, is that that through all this the ATSB continues to stonily ignore the use of a powerful search technique that’s been used to find everything from lost nuclear submarines to the wreckage of passenger aircraft.  Continue Reading…

Here’s an interesting graph that compares Class A mishap rates for USN manned aviation (pretty much from float plane to Super-Hornet) against the USAF’s drone programs. Interesting that both programs steadily track down decade by decade, even in the absence of formal system safety programs for most of the time (1).

USN Manned Aviation vs USAF Drones

The USAF drone program start out with around the 60 mishaps per 100,000 flight hour rate (equivalent to the USN transitioning to fast jets at the close of the 1940s) and maintains a steeper decrease rate that the USN aviation program. As a result while the USAF drones program is tail chasing the USN it still looks like it’ll hit parity with the USN sometime in the 2040s.

So why is the USAF drone program doing better in pulling down the accident rate, even when they don’t have a formal MIL-STD-882 safety program?

Well for one a higher degree of automation does have comparitive advantages. Although the USN’s carrier aircraft can do auto-land, they generally choose not to, as pilot’s need to keep their professional skills up, and human error during landing/takeoff inevitably drives the mishap rate up. Therefore a simple thing like implementing an auto-land function for drones (landing a drone is as it turns out not easy) has a comparatively greater bang for your safety buck. There’s also inherently higher risks of loss of control and mid air collision when air combat manoeuvring, or running into things when flying helicopters at low level which are operational hazards that drones generally don’t have to worry about.

For another, the development cycle for drones tends to be quicker than manned aviation, and drones have a ‘some what’ looser certification regime, so improvements from the next generation of drone design tend to roll into an expanding operational fleet more quickly. Having a higher cycle rate also helps retain and sustain the corporate memory of the design teams.

Finally there’s the lessons learned effect. With drones the hazards usually don’t need to be identified and then characterised. In contrast with the early days of jet age naval aviation the hazards drone face are usually well understood with well understood solutions, and whether these are addressed effectively has more to do with programmatic cost concerns than a lack of understanding. Conversely when it actually comes time to do something like put de-icing onto a drone, there’s a whole lot of experience that can be brought to bear with a very good chance of first time success.

A final question. Looking at the above do we think that the application of rigorous ‘FAA like’ processes or standards like ARP 4761, ARP 4754 and DO-178 would really improve matters?

Hmmm… maybe not a lot.

Notes

1. As a historical note while the F-14 program had the first USN aircraft system safety program (it was a small scale contractor in house effort) it was actually the F/A-18 which had the first customer mandated and funded system safety program per MIL-STD-882. USAF drone programs have not had formal system safety programs, as far as I’m aware.
Continue Reading…

SR-71 flight instruments (Image source: triddle)

How a invention that flew on the SR-71 could help commercial aviation today 

In a previous post on unusual attitude I talked about the use of pitch ladders as a means of providing greater attensity to aircraft attitude as well as a better indication of what the aircraft is dong, having entered into it. There are, of course, still disadvantages to this because such data in a commercial aircraft is usually presented ‘eyes down’, and in high stress, high workload situations it can be difficult to maintain an instrument scan pattern. There is however an alternative, and one that has a number of allied advantages. Continue Reading…

Unreliable airspeed events pose a significant challenge (and safety risk) because such situations throw onto aircrew the most difficult (and error prone) of human cognitive tasks, that of ‘understanding’ a novel situation. This results in a double whammy for unreliable airspeed incidents. That is the likelihood of an error in ‘understanding’ is far greater than any other error type, and having made that sort of error it’s highly likely that it’s going to be a fatal one. Continue Reading…

15 Minutes

11/02/2015

What the future of high assurance may look like, DARPA’s HACMS, open source and formal from the ground up.

A Critical Systems Blog

Some of the work I lead at Galois was highlighted in the initial story on 60 Minutes last night, a spot interviewing Dan Kaufman at DARPA. I’m Galois’ principal investigator for the HACMS program, focused on building more reliable software for automobiles and aircraft and other embedded systems. The piece provides a nice overview for the general public on why software security matters and what DARPA is doing about it; HACMS is one piece of that story.

I was busy getting married when filming was scheduled, but two of my colleagues (Dylan McNamee and Pat Hickey) appear in brief cameos in the segment (don’t blink!). Good work, folks! I’m proud of my team and the work we’ve accomplished so far.

You can see more details about how we have been building better programming languages for embedded systems and using them to build unpiloted air vehicle software here.

View original post