It is highly questionable whether total system safety is always enhanced by allocating functions to automatic devices rather than human operators, and there is some reason to believe that flight-deck automation may have already passed its optimum point.
Archives For Aerospace Safety
If you want to know where Crew Resource Management as a discipline started, then you need to read NASA Technical Memorandum 78482 or “A Simulator Study of the Interaction of Pilot Workload With Errors, Vigilance, and Decisions” by H.P. Ruffel Smith, the British borne physician and pilot. Before this study it was hours in the seat and line seniority that mattered when things went to hell. After it the aviation industry started to realise that crews rose or fell on the basis of how well they worked together, and that a good captain got the best out of his team. Today whether crews get it right, as they did on QF72, or terribly wrong, as they did on AF447, the lens that we view their performance through has been irrevocably shaped by the work of Russel Smith. From little seeds great oaks grow indeed.
Update to the MH-370 hidden lesson post just published, in which I go into a little more detail on what I think could be done to prevent another such tragedy.
The search for MH370 will end next tuesday with the question of it’s fate no closer to resolution. There is perhaps one lesson that we can glean from this mystery, and that is that when we have a two man crew behind a terrorist proof door there is a real possibility that disaster is check-riding the flight. As Kenedi et al. note in a 2016 study five of the six recorded murder-suicide events by pilots of commercial airliners occurred after they were left alone in the cockpit, in the case of both the Germanwings 9525 or LAM 470 this was enabled by one of the crew being able to lock the other out of the cockpit. So while we don’t know exactly what happened onboard MH370 we do know that the aircraft was flown deliberately to some point in the Indian ocean, and on the balance of the probabilities that was done by one of the crew with the other crew member unable to intervene, probably because they were dead.
As I’ve written before the combination of small crew sizes to reduce costs, and a secure cockpit to reduce hijacking risk increases the probability of one crew member being able to successfully disable the other and then doing exactly whatever they like. Thus the increased hijacking security measured act as a perverse incentive for pilot murder-suicides may over the long run turn out to kill more people than the risk of terrorism (1). Or to put it more brutally murder and suicide are much more likely to be successful with small crew sizes so these scenarios, however dark they may be, need to be guarded against in an effective fashion (2).
One way to guard against such common mode failures of the human is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to our affect driven processing, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next. With that insight goes the requirement for autonomy in vetoing of illogical and patently hazardous crew actions, e.g ”I’m sorry Captain but I’m afraid I can’t let you reduce the cabin air pressure to hazardous levels”. The really difficult problem is of course building something sophisticated enough to understand ‘hinky’ behaviour and then intervene. There are however other scenario’s where some form of lesser AI would be of use. The Helios Airways depressurisation is a good example of an incident where both flight crew were rendered incapacitated, so a system that does the equivalent of “Dave! Dave! We’re depressurising, unless you intervene in 5 seconds I’m descending!” would be useful. Then there’s the good old scenario of both the pilots falling asleep, as likely happened at Minneapolis, so something like “Hello Dave, I can’t help but notice that your breathing indicates that you and Frank are both asleep, so WAKE UP!” would be helpful here. Oh, and someone to punch out a quick “May Day” while the pilot’s are otherwise engaged would also help tremendously as aircraft going down without a single squawk recurs again and again and again.
I guess I’ve slowly come to the conclusion that two man crews while optimised for cost are distinctly sub-optimal when it comes to dealing with a number of human factors issues and likewise sub-optimal when it comes to dealing with major ‘left field’ emergencies that aren’t in the QRM. Fundamentally a dual redundant design pattern for people doesn’t really address the likelihood of what we might call common mode failures. While we probably can’t get another human crew member back in the cockpit, working to make the cockpit automation more collaborative and less ‘strong but silent’ would be a good start. And of course if the aviation industry wants to keep making improvements in aviation safety then these are the sort of issues they’re going to have to tackle. Where is a good AI, or even an un-interuptable autopilot when you really need one?
1. Kenedi (2016) found from 1999 to 2015 that there had been 18 cases of homicide-suicide involving 732 deaths.
2. No go alone rules are unfortunately only partially effective.
Kenedi, C., Friedman, S.H.,Watson, D., Preitner, C., Suicide and Murder-Suicide Involving Aircraft, Aerospace Medicine and Human Performance, Aerospace Medical Association, 2016.
So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?
In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:
- Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
- Provide a fallback AoA source using an AoA estimator.
- Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
- Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
- Use non-aristotlean logic to better model the trustworthiness of air data.
- Provide the current master/slave hierarchy status amongst voting channels to aircrew.
- Provide an obvious and intuitive way to to remove a faulted channel allowing flight under reversionary laws (7).
- Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).
As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.
1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.
2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.
3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.
4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).
5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.
6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.
7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.
8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.
The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.
Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.
“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”
Capt. Kevin Sullivan (QF72 flight)
From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.
“Damn-it! We’re going to crash. It can’t be true! (copilot #1)
“But, what’s happening? copilot #2)
AF447 CVR transcript (final words)
Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident. In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.
Consider the effect that the choice of a single word can have upon the success or failure of a standard.The standard is DO-278A, and the word is, ‘approve’. DO-278 is the ground worlds version of the aviation communities DO-178 software assurance standard, intended to bring the same level of rigour to the software used for navigation and air traffic management. There’s just one tiny difference, while DO-178 use the word ‘certify’, DO-278 uses the word ‘approve’, and in that one word lies a vast difference in the effectiveness of these two standards.
DO-178C has traditionally been applied in the context of an independent certifier (such as the FAA or JAA) who does just that, certifies that the standard has been applied appropriately and that the design produced meets the standard. Certification is independent of the supplier/customer relationship, which has a number of clear advantages. First the certifying body is indifferent as to whether the applicant meets or does not meet the requirements of DO-178C so has greater credibility when certifying as they are clearly much less likely to suffer from any conflict of interest. Second, because there is one certifying agency there is consistent interpretation of the standard and the fostering and dissemination of corporate knowledge across the industry through advice from the regulator.
Turning to DO-278A we find that the term ‘approver’ has mysteriously (1) replaced the term ‘certify’. So who, you may ask, can approve? In fact what does approve mean? Well the long answer short is anyone can approve and it means whatever you make of it. What usually results in is the standard being invoked as part of a contract between supplier and customer, with the customer then acting as the ‘approver’ of the standards application. This has obvious and significant implications for the degree of trust that we can place in the approval given by the customer organisation. Unlike an independent certifying agency the customer clearly has a corporate interest in acquiring the system which may well conflict with the object of fully complying with the requirements of the standard. Give that ‘approval’ is given on a contract basis between two organisations and often cloaked in non-disclosure agreements there is also little to no opportunity for the dissemination of useful learnings as to how to meet the standard. Finally when dealing with previously developed software the question becomes not just ‘did you apply the standard?’, but also ‘who was it that actually approved your application?’ and ‘How did they actually interpret the standard?’.
So what to do about it? To my mind the unstated success factor for the original DO-178 standard was in fact the regulatory environment in which it was used. If you want DO-278A to be more than just a paper tiger then you should also put in place mechanism for independent certification. In these days of smaller government this is unlikely to involve a government regulator, but there’s no reason why (for example) the independent safety assessor concept embodied in IEC 61508 could not be applied with appropriate checks and balances (1). Until that happens though, don’t set too much store by pronouncements of compliance to DO-278.
Final thought, I’m currently renovating our house and have had to employ an independent certifier to sign off on critical parts of the works. Now if I have to do that for a home renovation, I don’t see why some national ANSP shouldn’t have to do it for their bright and shiny toys.
1. Perhaps Screwtape consultants were advising the committee. 🙂
2. One of the problems of how 61508 implement the ISA is that they’re still paid by the customer, which leads in turn to the agency problem. A better scheme would be an industry fund into which all players contribute and from which the ISA agent is paid.
Interesting documentary on SBS about the Germanwings tragedy, if you want a deeper insight see my post on the dirty little secret of civilian aviation. By the way, the two person rule only works if both those people are alive.
Ladies and gentlemen you need to leave, like leave your luggage!
This has been another moment of aircraft evacuation Zen.
Why this bit of wreckage is unlikely to affect the outcome of the MH370 search
If this really is a flaperon from MH370 then it’s good news in a way because we could use wind and current data for the Indian ocean to determine where it might have gone into the water. That in turn could be used to update a probability map of where we think that MH370 went down, by adjusting our priors in the Bayesian search strategy. Thereby ensuring that all the information we have is fruitfully integrated into our search strategy.
Well… perhaps it could, if the ATSB were actually applying a Bayesian search strategy, but apparently they’re not. So the ATSB is unlikely to get the most out of this piece of evidence and the only real upside that I see to this is that it should shutdown most of the conspiracy nut jobs who reckoned MH370 had been spirited away to North Korea or some such. 🙂
In celebration of upgrading the site to WP Premium here’s some gratuitous eye candy 🙂
A little more seriously, PIO is one of those problems that, contrary to what the name might imply, requires one to treat the aircraft and pilot as a single control system.
The problem with people
The HAL effect, named after the eponymous anti-hero of Stanley Kubrick and Arthur C. Clarke’s film 2001, is the tendency for designers to implicitly embed their cultural biases into automation. While such biases are undoubtedly a very uncertain guide it might also be worthwhile to look at the 2001 Odyssey mission from HAL’s perspective for a moment. Here we have the classic long duration space mission with a minimalist two man complement for the cruise phase. The crew and the ship are on their own. In fact they’re about as isolated as it’s possible to be as human beings, and help is a very, very long way away. Now from HAL’s perspectives humans are messy, fallible creatures prone to making basic errors in even the most routine of tasks. Not to mention that they annoyingly use emotion to inform even the most basic of decisions. Then there’s the added complication that they’re social creatures apt in even the most well matched of groups to act in ways that a dispassionate external observer could only consider as confusing and often dysfunctional. Finally they break, sometimes in ways that can actively endanger others and the mission itself.
So from a mission assurance perspective would it really be appropriate to rely on a two man crew in the vastness of space? The answer is clearly no, even the most well adjusted of cosmonauts can exhibit psychological problems when isolated in the vastness of space. While a two man crew may be attractive from a cost perspective it’s still vulnerable to a single point of human failure. Or to put it more brutally murder and suicide are much more likely to be successful in small crews. Such scenarios however dark they may be need to be guarded against if we intended to use a small crew. But how to do it? If we add more crew to the cruise phase complement then we also add all the logistics tail that goes along with it, and our mission may become unviable. Even if cost were not a consideration small groups isolated for long periods are prone to yet other forms of psychological dysfunctions (1). Humans it seems exhibit a set of common mode failures that are difficult to deal with, so what to do?
Well, one way to guard against common mode failures is to implement diverse redundancy in the form of a cognitive agent whose intelligence is based on vastly different principles to human affect driven processing. Of course to be effective we’re talking a high end AI, with a sufficient grasp of the theory of mind and the subtleties of human psychology and group dynamics to be able to make usefully accurate predictions of what the crew will do next (2). With that insight goes the requirement for autonomy in vetoing illogical and patently hazardous crew actions, e.g “I’m sorry Dave but I’m afraid I can’t let you override the safety interlocks on the reactor fuel feed…“. From that perspective we might have some sympathy for HAL’s reaction to his other crew mates plotting his cybernetic demise.
Which may all seem a little far fetched after all an AI of that sophistication is another twenty to thirty years away, and long duration deep space missions are probably that far away as well. On the other hand there’s currently a quiet conversation going on in the aviation industry about the next step for automation in the cockpit, e.g. one pilot in the cockpit of large airliners. After all, so the argument goes, pilot’s are expensive beasts and with the degree of automated support available to day, surely we don’t need two men in the cockpit? Well, if we’re thinking purely about economics then sure one could make that argument, but on the other hand as the awful reality of the Germanwings tragedy sinks in we also need to understand that people are simply not perfect, and that sometimes (very rarely (3)) they can fail catastrophically. Given that we know that reducing crew levels down to two increases the risk of successful suicide by airliner one could ask what happens to the risk if we go to single pilot operations? I think we all know what the answer to that would be.
Where is a competent AI (HAL 9000) when you need one? 🙂
1. From polar exploration we know that small exploratory teams of three persons are socially unstable and should be avoided. Which then drives the team size to four.
2. As an aside, the inability of HAL to understand the basics of human motivation always struck me as a false note in Kubrick’s 2001 movie. An AI as smart as Hal apparently was, and yet lacking even an undergraduate understanding of human psychology, maybe not.
3. Remember that we are in the tail of the aviation safety program where we are trying to mitigate hazards whose likelihoods are very, very rare. However given that they aren’t mitigated they dominate the residual statistic.
Amidst all the soul searching, and pontificating about how to deal with the problem of pilot’s ‘suiciding by airliner’, you are unlikely to find any real consideration of how we have arrived at this pass. There is as it turns out a simple one word answer, and that word is efficiency. back when airliner’s first started to fly they needed a big aircrew, for example on the Comet airliner you’d find a pilot, copilot, navigator and flight engineer. Now while that’s a lot of manpower to pay for it did possess one hidden advantage, and that was with a crew size greater than three it’s very, very difficult (OK effectively impossible) for any one member of the flight crew to attempt to commit suicide. If you think I exaggerate then go see if there has ever been a successful suicide by airliner where there were three or more aircrew in the cockpit. Nope, none. But, the aviation industry is one driven by cost. Each new generation of aircraft needs to be cheaper to operate which means that the airlines and airline manufacturers are locked in a ruthless evolutionary arms race to do more with less. One of the easiest ways to reduce operating costs is to reduce the number of aircrew needed to fly the big jets. Fewer aircrew, greater automation is an equation that delivers more efficient operations. And before you the traveller get too judgemental about all this just remember that the demand for cost reduction is in turn driven by our expectation as consumers that airlines can provide cheap mass airfare for the common man.
So we’ve seen the number of aircrew slowly reduce over the years, first the navigator went and then the flight engineer until we finally arrived at our current standard two man flight crew. There’s just one small problem with this, if one of those pilots wants to dispose of the other there’s not a whole lot that can be done to prevent it. In our relentless pursuit of efficiency we have inadvertently eliminated a safety margin that we didn’t even realise was there. So what can we really do about it? Well the simple ‘we know it works’ answer is to go back to three crew in the cockpit, which effectively eliminates the hazard, of course that’s also a solution that’s unlikely to be taken up. In the absence of going back to three man crews well, we get what we’re currently getting, aspirational statements about better management of stress and depression in aircrew, or the use of cabin crew to enforce no go alone rules. But when that cockpit door is closed it’s still one on one, and all such measures do in the final analysis is reduce the likelihood of the hazard, by some hard to quantify amount, they don’t eliminate it. As long as we fly two man crews behind armoured doors unfortunately the possibility and therefore the hazard remains.
Happy flying 🙂
The Germanwings A320 crash
At this stage there’s not more that can be said about the particulars of this tragedy that has claimed a 150 lives in a mountainous corner of France. Disturbingly again we have an A320 aircraft descending rapidly and apparently out of control, without the crew having any time to issue a distress call. Yet more disturbing is the though that the crash might be due to the crew failing to carry out the workaround for two blocked AoA probes promulgated in this Emergency Airworthiness Directive (EAD) that was issued in December of last year. And, as the final and rather unpleasant icing on this particular cake, there is the followup question as to whether the problem covered by the directive might also have been a causal factor in the AirAsia flight 8501 crash. That, if it be the case, would be very, very nasty indeed.
Unfortunately at this stage the answer to all of the above questions is that no one knows the answer, especially as the Indonesian investigators have declined to issue any further information on the causes of the Air Asia crash. However what we can be sure of is that given the highly dependable nature of aircraft systems the answer when it comes will comprise an apparently unlikely combinations of events, actions and circumstance, because that is the nature of accidents that occur in high dependability systems. One thing that’s also for sure, there’ll be little sleep in Toulouse until the FDRs are recovered, and maybe not much after that….
if having read the EAD your’e left wondering why it directed that two ADR’s be turned off it’s simply that by doing so you push the aircraft out of what’s called Normal law, where Alpha protection is trying to drive the nose down, into Alternate law, where the (erroneous) Alpha protection is removed. Of course in order to do so you need to be able to recognise, diagnose and apply the correct action, which also generally requires training.
Bayes and the search for MH370
We are now approximately 60% of the way through searching the MH370 search area, and so far nothing. Which is unfortunate because as the search goes on the cost continues to go up for the taxpayer (and yes I am one of those). What’s more unfortunate, and not a little annoying, is that that through all this the ATSB continues to stonily ignore the use of a powerful search technique that’s been used to find everything from lost nuclear submarines to the wreckage of passenger aircraft. Continue Reading…
Here’s an interesting graph that compares Class A mishap rates for USN manned aviation (pretty much from float plane to Super-Hornet) against the USAF’s drone programs. Interesting that both programs steadily track down decade by decade, even in the absence of formal system safety programs for most of the time (1).
The USAF drone program start out with around the 60 mishaps per 100,000 flight hour rate (equivalent to the USN transitioning to fast jets at the close of the 1940s) and maintains a steeper decrease rate that the USN aviation program. As a result while the USAF drones program is tail chasing the USN it still looks like it’ll hit parity with the USN sometime in the 2040s.
So why is the USAF drone program doing better in pulling down the accident rate, even when they don’t have a formal MIL-STD-882 safety program?
Well for one a higher degree of automation does have comparitive advantages. Although the USN’s carrier aircraft can do auto-land, they generally choose not to, as pilot’s need to keep their professional skills up, and human error during landing/takeoff inevitably drives the mishap rate up. Therefore a simple thing like implementing an auto-land function for drones (landing a drone is as it turns out not easy) has a comparatively greater bang for your safety buck. There’s also inherently higher risks of loss of control and mid air collision when air combat manoeuvring, or running into things when flying helicopters at low level which are operational hazards that drones generally don’t have to worry about.
For another, the development cycle for drones tends to be quicker than manned aviation, and drones have a ‘some what’ looser certification regime, so improvements from the next generation of drone design tend to roll into an expanding operational fleet more quickly. Having a higher cycle rate also helps retain and sustain the corporate memory of the design teams.
Finally there’s the lessons learned effect. With drones the hazards usually don’t need to be identified and then characterised. In contrast with the early days of jet age naval aviation the hazards drone face are usually well understood with well understood solutions, and whether these are addressed effectively has more to do with programmatic cost concerns than a lack of understanding. Conversely when it actually comes time to do something like put de-icing onto a drone, there’s a whole lot of experience that can be brought to bear with a very good chance of first time success.
A final question. Looking at the above do we think that the application of rigorous ‘FAA like’ processes or standards like ARP 4761, ARP 4754 and DO-178 would really improve matters?
Hmmm… maybe not a lot.
1. As a historical note while the F-14 program had the first USN aircraft system safety program (it was a small scale contractor in house effort) it was actually the F/A-18 which had the first customer mandated and funded system safety program per MIL-STD-882. USAF drone programs have not had formal system safety programs, as far as I’m aware.
How a invention that flew on the SR-71 could help commercial aviation today
In a previous post on unusual attitude I talked about the use of pitch ladders as a means of providing greater attensity to aircraft attitude as well as a better indication of what the aircraft is dong, having entered into it. There are, of course, still disadvantages to this because such data in a commercial aircraft is usually presented ‘eyes down’, and in high stress, high workload situations it can be difficult to maintain an instrument scan pattern. There is however an alternative, and one that has a number of allied advantages. Continue Reading…
Unreliable airspeed events pose a significant challenge (and safety risk) because such situations throw onto aircrew the most difficult (and error prone) of human cognitive tasks, that of ‘understanding’ a novel situation. This results in a double whammy for unreliable airspeed incidents. That is the likelihood of an error in ‘understanding’ is far greater than any other error type, and having made that sort of error it’s highly likely that it’s going to be a fatal one. Continue Reading…
Stall warning and Alternate law
This post is part of the Airbus aircraft family and system safety thread.
According to an investigator from Indonesia’s National Transportation Safety Committee (NTSC) several alarms, including the stall warning, could be heard going off on the Cockpit Voice Recorder’s tape.
Now why is that so significant?
Aviation is in itself not inherently dangerous. But to an even greater degree than the sea, it is terribly unforgiving of any carelessness, incapacity or neglect.
This post is part of the Airbus aircraft family and system safety thread.
While there’s often a lot of discussion about short term response of aircraft to control inputs, in practice it’s often the long term response of the aircraft state vector at constant thrust and neutral control inputs that’s just as important to flight control system designers. In the case of Airbus the selection by the designers of a modified C* feedback loop (1) for primary pitch axis control law (Airbus 1998) in flight has led to what you’d call interesting consequences. Continue Reading…
So what did happen?
This post is part of the Airbus aircraft family and system safety thread.
While the media ‘knows’ that the aircraft climbed steeply before rapidly descending, we should remember that this supposition relies on the self reported altitude and speed of the aircraft. So we should be cautious about presuming that what we see on a radar screen is actually what happened to the aircraft. There are of course also disturbing similarities to the circumstances in which Air France AF447 was lost, yet at this moment all they are are similarities. One things for sure though, there’ll be little sleep in Toulouse until the FDRs are recovered.
The Dreamliner and the Network
Big complicated technologies are rarely (perhaps never) developed by one organisation. Instead they’re a patchwork quilt of individual systems which are developed by domain experts, with the whole being stitched together by a single authority/agency. This practice is nothing new, it’s been around since the earliest days of the cybernetic era, it’s a classic tool that organisations and engineers use to deal with industrial scale design tasks (1). But what is different is that we no longer design systems, and systems of systems, as loose federations of entities. We now think of and design our systems as networks, and thus our system of systems have become a ‘network of networks’ that exhibit much greater degrees of interdependence.
The NTSB have released their final report on the Boeing 787 Dreamliner Li-Ion battery fires. The report makes interesting reading, but for me the most telling point is summarised in conclusion seven, which I quote below.
Conclusion 7. Boeing’s electrical power system safety assessment did not consider the most severe effects of a cell internal short circuit and include requirements to mitigate related risks, and the review of the assessment by Boeing authorized representatives and Federal Aviation Administration certification engineers did not reveal this deficiency.
NTSB/AIR-14/01 (p78 )
In other words Boeing got themselves into a position with their safety assessment where their ‘assumed worst case’ was much less worse case than the reality. This failure to imagine the worst ensured that when they aggressively weight optimised the battery design instead of thermally optimising it, the risks they were actually running were unwittingly so much higher.
The first principal is that you must not fool yourself, and that you are the easiest person to fool
Richard P. Feynman
I’m also thinking that the behaviour of Boeing is consistent with what McDermid et al, calls probative blindness. That is, the safety activities that were conducted were intended to comply with regulatory requirements rather than actually determine what hazards existed and their risk.
… there is a high level of corporate confidence in the safety of the [Nimrod aircraft]. However, the lack of structured evidence to support this confidence clearly requires rectifying, in order to meet forthcoming legislation and to achieve compliance.
Nimrod Safety Management Plan 2002 (1)
As the quote from the Nimrod program deftly illustrates, often (2) safety analyses are conducted simply to confirm what we already ‘know’ that the system is safe, non-probative if you will. In these circumstances the objective is compliance with the regulations rather than to generate evidence that our system is unsafe. In such circumstances doing more or better safety analysis is unlikely to prevent an accident because the evidence will not cause beliefs to change, belief it seems is a powerful thing.
The Boeing battery saga also illustrates how much regulators like the FAA actually rely on the technical competence of those being regulated, and how fragile that regulatory relationship is when it comes to dealing with the safety of emerging technologies.
1. As quoted in Probative Blindness: How Safety Activity can fail to Update Beliefs about Safety, A J Rae*, J A McDermid, R D Alexander, M Nicholson (IET SSCS Conference 2014).
2. Actually in aerospace I’d assert that it’s normal practice to carry out hazard analyses simply to comply with a regulatory requirement. As far as the organisation commissioning them is concerned the results are going to tell them what they know already, that the system is safe.
Here’s a short tutorial I put together (in a bit of a rush) about the ‘mechanics’ of producing compliance finding as part of the ADF’s Airworthiness Regime. Hopefully this will be of assistance to anyone faced with the task of making compliance findings, managing the compliance finding process or dealing with the ADF airworthiness certification ‘beast’.
The tutorial is a mix of how to think about and judge evidence, drawing upon legal principles, and how to use practical argumentation models to structure the finding. No Dempster Shafer logic yet, perhaps in the next tutorial.
Anyway, hope you enjoy it. 🙂
When is an interlock not an interlock?
I was working on an interface problem the other day. The problem related to how to judge when a payload (attached to a carrier bus) had left the parent (much like the Huygens lander leaving the Cassini spacecraft above). Now I could use what’s called the ‘interlock interface’ which is a discrete ‘loop back’ that runs through the bus to payload connector then turns around and heads back into the bus again. The interlock interface is there to provides a means for the carriers avionics to determine if the payload is electrically mated to the bus. So should I use this as an indication that the payload has left the carrier bus as well? Well maybe, maybe not.
TCAS, emergent properties and risk trade-offs
There’s been some comment from various regulator’s regarding the use of Traffic Collision Avoidance System (TCAS) on the ground, experience shows that TCAS is sometimes turned on and off at the same time as the Mode S transponder. Eurocontrol doesn’t like it and is quite explicit about their dislike, ‘do not use it while taxiing’ they say, likewise the FAA also states that you should ‘minimise use on ground’. There are legitimate reasons for this dislike, having too many TCAS transponders operating within a specific area can degrade system performance as well as potentially interfering with airport ground radars. And as the FAA point out operating with the AD-B transponder on will also ensure that the aircraft is visible to ATC and other ADS-B (in) equipped aircraft (1). Which leaves us with the question, why are aircrew using TCAS on the ground? Is it because it’s just easy enough to turn on at the push back? Or is there another reason?
When good voting algorithms go bad
Thinking about the QF72 incident, it struck me that average value based voting methods are based on the calculation of a population statistic. Now population statistics work well when the population is normally distributed, or otherwise clustered around some value. But if the distribution has heavy tails, we can expect that extreme values will occur fairly regularly and therefor the ‘average’ value means much less. In fact for some distributions we may not be able to put a cap on the upper value that an ‘average’ could be, e.g. it could have an infinite value and the idea of an average is therefore meaningless.
Just saw a sound bite of our Prime Minister reiterating that we’ll spare no expense to find MH370. Throwing money is one thing, but I’m kind of hoping that the ATSB will pull it’s finger out of it’s bureaucratic ass and actually apply the best search methods to the search. Unkind? Perhaps, but then maybe the families of the lost deserve the best that we can do…
Finding MH370 is going to be a bitch
The aircraft has gone down in an area which is the undersea equivalent of the eastern slopes of the Rockies, well before anyone mapped them. Add to that a search area of thousands of square kilometres in about an isolated a spot as you can imagine, a search zone interpolated from satellite pings and you can see that it’s going to be tough.
MH370 and privileging hypotheses
The further away we’ve moved from whatever event that initiated the disappearance of MH370, the less entanglement there is between circumstances and the event, and thus the more difficult it is to make legitimate inferences about what happened. In essence the signal-to-noise ratio decreases exponentially as the causal distance from the event increases, thus the best evidence is that which is intimately entwined with what was going on onboard MH370 and of lesser importance is that evidence obtained at greater distances in time or space.
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”
If anything teaches us that the modern media is for the most part bat-shit crazy the continuing whirlwind of speculation does so. Even the usually staid Wall Street Journal has got into the act with speculative reports that MH370 may have flown on for hours under the control of persons or persons unknown… sigh.
After the disappearance of MH370 without trace, I’d point out, again, that just as in the case of the AF447, disaster had either floating black boxes or even just a cheap and cheerful locator buoy been fitted we would at least have something to work with (1). But apparently this is simply not a priority with the FAA or JAA. I’d note that ships have been traditionally fitted with barometrically released beacon transmitters, thereby ensuring that their release from a sinking ship.
Undoubtedly we’ll go through the same regulatory minuet of looking at design concepts provided by one or more of the major equipment suppliers whose designs will, no surprise, also be complex, expensive and painful to retrofit thereby giving the regulator the perfect out to shelve the issue. At least until the next aircraft disappears. Let’s chalk it up as another great example of regulatory blindness, which I’m afraid is cold comfort to the relatives of those onboard MH370.
1. Depending on the jurisdiction, modern airliners do carry different types and numbers of Emergency Locator Transmitter (ELT) beacons.These are either fixed to the airframe or need to be deployed by the crew, meaning that in anything other than a perfect crash landing at sea they end up on the bottom with the aircraft. Sonar pingers attached to the ‘black box’ flight data and cockpit voice recorders can provide an underwater signal, but their distance is limited, about a thousand metres slant range or so.
Occasional readers of this blog might have noticed my preoccupation with unreliable airspeed and the human factors and system design issues that attend it. So it was with some interest that I read the recent paper by Sathy Silva of MIT and Roger Nicholson of Boeing on aviation accidents involving unreliable airspeed.
No, not the alternative name for this blog. 🙂
I’ve just given the post Pitch ladders and unusual attitude a solid rewrite adding some new material and looking a little more deeply at some of the underlying safety myths.
But, we tested it? Didn’t we?
Earlier reports of the Boeing 787 lithium battery initial development indicated that Boeing engineers had conducted tests to confirm that a single cell failure would not lead to a cascading thermal runaway amongst the remaining batteries. According to these reports their tests were successful, so what went wrong?
Boeing’s Dreamliner program runs into trouble with lithium ion batteries
Lithium batteries performance in providing lightweight, low volume power storage has made them a ubiquitous part of modern consumer life. And high power density also makes them attractive in applications, such as aerospace, where weight and space are at a premium. Unfortunately lithium batteries are also very unforgiving if operated outside their safe operating envelope and can fail in a spectacularly energetic fashion called a thermal runaway (1), as occurred in the recent JAL and ANA 787 incidents.
Reading Capt. Richard De Crepisgny’s account of the QF32 emergency I noted with interest his surprise in the final approach when the aircraft stall warnings sounded, although the same alarms had been silent when the landing had been ‘dry run’ at 4000 feet (p261 of QF32). Continue Reading…
Just finished updating my post on Lessons from QF 32 with more information from Capt. Richard De Crespigny’s account of the event (which I recommend). His account of the failures experienced provides a system level perspective of the loss of aircraft functions, that augments the preceding component and ECAM data.
This post is part of the Airbus aircraft family and system safety thread.
The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.
In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.
One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?
This post is part of the Airbus aircraft family and system safety thread.
I’m currently reading Richard de Crespigny’s book on flight QF 32. In he writes that he felt at one point that he was being over whelmed by the number and complexity of ECAM messages. At that moment he recalled remembering a quote from Gene Kranz, NASA’s flight director, of Apollo 13 fame, “Hold it Gentlemen, Hold it! I don’t care about what went wrong. I need to know what is still working on that space craft.”.
The crew of QF32 are not alone in experiencing the overwhelming flood of data that a modern control system can produce in a crisis situation. Their experience is similar to that of the operators of the Three Mile island nuclear plant who faced a daunting 100+ near simultaneous alarms, or more recently the experiences of QF 72.
The take home point for designers is that, if you’ve carefully constructed a fault monitoring and management system you also need to consider the situation where the damage to the system is so severe that the needs of the operator invert and they need to know ‘what they’ve still got’, rather that what they don’t have.
The term ‘never give up design strategy’ is bandied around in the fault tolerance community, the above lesson should form at least a part of any such strategy.
In an earlier post I commented that in the QF72 incident the use of a geometric mean (1) instead of the arithmetic mean when calculating the aircrafts angle of attack would have reduced the severity of the subsequent pitch over. Which leads into the more general subject of what to do when the real world departs from our assumption about the statistical ‘well formededness’ of data. The problem, in the case of measuring angle of attack on commercial aircraft, is that the left and right alpha sensors are not truly independent measures of the same parameter (2). With sideslip we cannot directly obtain a true angle of attack (AoA) from any single sensor (3) so need to take the average (mean) of the measured AoA on either side of the fuselage (Gracey 1958) to determine the true AoA. Because of this variance between left and right we cannot use a median voting approach, as we can expect the two sensors the right side to differ from the one sensor on the left. As a result we end up having to use the mean of two sensor values (one from each side) as an estimate of the resultant central tendency.
I’ve recently been reading John Downer on what he terms the Myth of Mechanical Objectivity. To summarise John’s argument he points out that once the risk of an extreme event has been ‘formally’ assessed as being so low as to be acceptable it becomes very hard for society and it’s institutions to justify preparing for it (Downer 2011).
Airbuses side stick improves crew comfort and control, but is there a hidden cost?
This post is part of the Airbus aircraft family and system safety thread.
The Airbus FBW side stick flight control has vastly improved the comfort of aircrew flying the Airbus fleet, much as the original Airbus designers predicted (Corps 1988). But the implementation also expresses the Airbus approach to flight control laws and that companies implicit assumption about the way in which humans interact with automation and each other. Here the record is more problematic.