Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.
Archives For Aerospace Safety
The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470 this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.
As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).
One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.
I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?
1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.
2. No go alone rules are unfortunately only partially effective.
Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.
So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?
In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:
- Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
- Provide a fallback AoA source using an AoA estimator.
- Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
- Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
- Use non-aristotlean logic to better model the trustworthiness of air data.
- Provide the current master/slave hierarchy status amongst voting channels to aircrew.
- Provide an obvious and intuitive way to to remove a faulted channel allowing flight under reversionary laws (7).
- Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).
As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.
1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.
2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.
3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.
4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).
5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.
6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.
7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.
8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.
The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.
Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.
“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”
Capt. Kevin Sullivan (QF72 flight)
From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.
“Damn-it! We’re going to crash. It can’t be true! (copilot #1)
“But, what’s happening? copilot #2)
AF447 CVR transcript (final words)
Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident. In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.
Consider the effect that the choice of a single word can have upon the success or failure of a standard.The standard is DO-278A, and the word is, ‘approve’. DO-278 is the ground worlds version of the aviation communities DO-178 software assurance standard, intended to bring the same level of rigour to the software used for navigation and air traffic management. There’s just one tiny difference, while DO-178 use the word ‘certify’, DO-278 uses the word ‘approve’, and in that one word lies a vast difference in the effectiveness of these two standards.
DO-178C has traditionally been applied in the context of an independent certifier (such as the FAA or JAA) who does just that, certifies that the standard has been applied appropriately and that the design produced meets the standard. Certification is independent of the supplier/customer relationship, which has a number of clear advantages. First the certifying body is indifferent as to whether the applicant meets or does not meet the requirements of DO-178C so has greater credibility when certifying as they are clearly much less likely to suffer from any conflict of interest. Second, because there is one certifying agency there is consistent interpretation of the standard and the fostering and dissemination of corporate knowledge across the industry through advice from the regulator.
Turning to DO-278A we find that the term ‘approver’ has mysteriously (1) replaced the term ‘certify’. So who, you may ask, can approve? In fact what does approve mean? Well the long answer short is anyone can approve and it means whatever you make of it. What usually results in is the standard being invoked as part of a contract between supplier and customer, with the customer then acting as the ‘approver’ of the standards application. This has obvious and significant implications for the degree of trust that we can place in the approval given by the customer organisation. Unlike an independent certifying agency the customer clearly has a corporate interest in acquiring the system which may well conflict with the object of fully complying with the requirements of the standard. Give that ‘approval’ is given on a contract basis between two organisations and often cloaked in non-disclosure agreements there is also little to no opportunity for the dissemination of useful learnings as to how to meet the standard. Finally when dealing with previously developed software the question becomes not just ‘did you apply the standard?’, but also ‘who was it that actually approved your application?’ and ‘How did they actually interpret the standard?’.
So what to do about it? To my mind the unstated success factor for the original DO-178 standard was in fact the regulatory environment in which it was used. If you want DO-278A to be more than just a paper tiger then you should also put in place mechanism for independent certification. In these days of smaller government this is unlikely to involve a government regulator, but there’s no reason why (for example) the independent safety assessor concept embodied in IEC 61508 could not be applied with appropriate checks and balances (1). Until that happens though, don’t set too much store by pronouncements of compliance to DO-278.
Final thought, I’m currently renovating our house and have had to employ an independent certifier to sign off on critical parts of the works. Now if I have to do that for a home renovation, I don’t see why some national ANSP shouldn’t have to do it for their bright and shiny toys.
1. Perhaps Screwtape consultants were advising the committee. 🙂
2. One of the problems of how 61508 implement the ISA is that they’re still paid by the customer, which leads in turn to the agency problem. A better scheme would be an industry fund into which all players contribute and from which the ISA agent is paid.
Interesting documentary on SBS about the Germanwings tragedy, if you want a deeper insight see my post on the dirty little secret of civilian aviation. By the way, the two person rule only works if both those people are alive.