Why the loss of Air France AF 447 is a wakeup call that our technological reach is again exceeding our grasp
So far as we know flight AF 447 fell out of the sky with it’s systems performing as their designers had specified, if not exactly how they expected, right up-to the point that it impacted the surface of the ocean.
So how is it possible that incorrect air data could simultaneously cause upsets in aircraft functions as disparate as engine auto-thrust management, flight control laws and traffic collision avoidance in a modern aircraft built to accepted international safety standards and containing multiple redundant systems?
Authors Note – 26 Jul 2011: I’ve revised this post to improve it’s general readability. My conclusion remains that this is a fundamentally a system accident.
Authors Note – 1 Aug 2012: In this and other posts I’ve been somewhat loosely using the term ‘epistemic’ to denote both epistemic and ontological uncertainty. So I’ve revised this post to more clearly identify ontological as well as epistemic uncertainty.
In the beginning was single point of failure
A fundamental goal and regulatory requirement of aviation safety is to prevent the loss of an aircraft in the event of a single point of failure (FAA AC1309-1A) (1).
Single failure defence goes back at least to the 50’s, and perhaps earlier, when aircraft systems were built as stand alone or loosely federated architectures and our primary concern was the physical failure of individual components.
But the world has moved on, today’s passenger aircraft are an order of magnitude more complex than those of the 1950’s even if they perform the same mission. This complexity is driven by the commercial imperatives of the airline industry. That is, to do more with less and do it cheaper.
In turn this has driven greater integration of systems and a transfer to digital and software intensive architectures to achieve more robust, efficient and flexible services. So the architecture of the air liner has evolved to become an integrated suite of systems.
But unfortunately the fundamental safety paradigm is still single point of failure as developed for a 1950’s architecture, and the cracks are starting to show.
Looking past the obvious
This increasing intergration also brings with it the hazard of unitended interactions and that’s why it’s important to look past the immediate issue of the icing of the Thales pitot tubes to the underlying vulnerabilities of the design (2).
These vulnerabilities exists on many levels but can be summarised as highly optimised tolerance. Modern aircraft are built to be robust to single points of failure (3) or specified environmental regimes but remain vulnerable to unanticipated (rare) events. This local optimisation can actually work against safety in the event of a rare or unanticipated failure occurring, because it is normally achieved through greater integration of system components. Unfortunately a higher level of integration also increases the potential for single failures to cascade through the system.
Coupling to the operational environment
One of the fundamental operational imperatives of the airline industry is to minimise cost and one way to do this is to fly at high altitude to reduce drag. But, operating in this region reduces the margin between overspeed and stall. Here small variations in angle of attack or speed can cause an aircraft to end up departing from normal flight. To meet this imperative safely the Airbus flight control laws are implemented to provide complete protection of the flight envelope. This minimises the work load on aircrew and enhances safety, important when routinely flying at high altitude near the Q or ‘coffin corner’ (4).
An allied operational constraint derived from this imperative is to minimise the distance taken to detour around storms. But flying closer also means increasing the likelihood of flying into areas prone to combined icing and turbulence (5).
So there is a constant operational pressure to operate the system closer to the edge of theoretical performance with lesser margin and to achieve this a complex automated system is required.
Median voting vulnerability
In the case of AF 447 ECAM and now FDR data indicate that a series of cascading failures were triggered by the common failure of the three pitot tube sensors when exposed to icing conditions that exceeded those specified to certify the sensor as fit for use (6).
Now the basic fault tolerance of the pitot sensing system implemented by Airbus on the A330 fleet is a three channel homogenous median voting scheme. This scheme is quite capable of dealing with a single channel sensor failur, i.e where a failed sensor exceeds a defined threshold distance from the median value and is ‘voted out’ as a consequence.
If only a single sensor had failed the median voting logic would have been able to deal with the failure, however three failed sensors exceeded the fault hypothesis. Further, in circumstances of two or more failed sensors median value voting logic can also generate extreme air data values.
Naturally as dealing with a N>1 fault hypothesis did not form part of the certification basis for the aircraft it was not considered in the fault tolerance design, nor would the median voting algorithms sensitivity to N>1 failures have been considered. Instead the approach adopted by Airbus to handle this violation of the fault hypothesis was to declare an unreliable air data event, transition to alternate law and pass responsibility back to the aircrew to diagnose the problem and respond accordingly.
Optimising data = coupled vulnerability
Within the air data system feedback loops had also been introduced to provide optimal mach corrections to other air parameters (such as altitude). But this also meant that false pitot data caused the propagation of unreliable air data into these parameters. Outside the air data system the air data errors were propagated to other systems ranging from the Traffic Collision Avoidance System (TCAS) to flight controls and thrust management systems.
One of the most concerning consequential faults is the potential reduction in engine N1 speed which, when combined with a reversion from normal to alternate flight control law, could lead to the engines entering a sustained low thrust state. Thus pitot icing translates into a series of across the board anomalies in aircraft systems making it much more difficult for the crew to evaluate the situation.
The human factor
To further add to the crew’s workload crew displays are not designed to cope with the rapid occurrence of multiple failures (see the QF 72 accident as an example of event overload), or the diagnosis of system anomalies such as unreliable airspeed. Again this is because the crew interface is optimised to routine operations and expected (single failures), rather than multiple failure events or ambiguous situations.
In response to a declared unreliable airspeed event the Airbus published procedures required the aircrew, having identified an unreliable airspeed/ADR disagreement event to disconnect autopilot and auto-throttle, then fly pitch and thrust based initially on memory items (if the aircraft’s immediate safety was at risk) and subsequently using QRH lookup tables if the two remaining ADR disagree or give an erroneous value.
In the real world the BEA (BEA 2009) found that these pitot icing intiated events were accompanied by auto-pilot/flight director disconnect and reconnects, transition to alternate law, autothrust disconnect and activation of thrust lock, stall warnings, fluctuation in engine N1 RPM, fluctuating speed anomalies (intermittent falls or a fast drop off then plateau) and changes in other air data (air temperature and altitude).
The BEA also found that pilot training scenario used in the simulator was one in which the aircraft remained in normal law and no alarms were triggered. There is clearly a significant difference between the complex multi-system, multi factor unreliable airspeed event in the real world and the simplified ‘single anomaly/failure’ scenario that was trained for.
Human behaviour in high stress unfamiliar, uncertain and confusing environments such as this is rarely straight forward, as the BEA’s second interim report into AF 447 showed. In a study of 13 unreliable airspeed incidents the BEA found that in none of the instances was there evidence of pitch/thrust memory items being applied and that in four of the incidents the crews did not even identify an unreliable airspeed situation.
Reliance by manufacturer and operator is placed on a quick and procedurally compliant response from the air crew to a simple problem, whereas in practice what is presented is a confusing and unfamiliar situation to which the majority response of aircrew could be characterised as sitting tight, ignoring the QRH procedure while trying to understand the situation and avoiding doing anything precipitate (7).
So a pitot icing event at high altitude, sufficient to cause a reversion of flight laws also presents the aircrew with a control problem where the margin for error is inherently less, alternate law warnings are less effective (8) and independent signs of incipient stall such as aircraft buffet and noise can be made more or less ambiguous by spurious stall warnings and the outside environment, e.g. turbulence.
Our reach exceeded
This interaction of the operational imperatives of modern aviation, a highly integrated architecture optimised to deal with single failures combined with operator training reflecting this ‘single failure paradigm’ has resulted in a system that is vulnerable to unexpected and undesigned for variations in that environment.
My conclusion is that the AF 447 disaster will prove to be fundamentally a system accident. That is, one driven by the unanticipated interaction of system components, the environment and operators. As such it is an example of how the safety risk of complex systems is dominated by epistemic and ontological uncertainty rather than aleatory uncertainty (random failure or events).
As the complexity of aircraft and the flight environment increases this form of accident will undoubtedly come to dominate aviation safety, if it has not already. A spectre may be haunting Airbus yes, but it’s one that the rest of the industry cannot ignore.
This post is part of the Airbus aircraft family and system safety thread.
BEA, Interim report no. 2, on the accident on 1st June 2009, to the Airbus A330-203, registered F-GZCP, operated by Air France flight AF 447 Rio de Janeiro – Paris, Report Number f-cp090601ae2, November 2009.
FAA, Advisory Circular AC 25.1309-1A, System Design and Analysis, 21, June 1988.
1. As embodied in the US FAR 25.1309 (B).
2. This is of course not just an Airbus problem given the common regulatory environment.
3. The current FAR 25.1309 requirements are now expressed probabilistically, i.e. the likelihood of catastrophic events are less than extremely improbable. However the extant FAA Advisory Circular AC 25.1309-1A still requires that special attention be paid to the use of fail safe design concepts protecting against single failures. This mixed deterministic and probabilistic approach aproach has been followed in modern programs, such as the A380 aircraft.
4. Given the persistent trope in the media, I’d clarify that at FL350 AF 447 was not as far as I am concerned flying close to ‘in’ to the coffin corner. However the term itself is a qualitative one. To quote FAA circular AC-61 701:
Q-Corner or Coffin Corner is a term used to describe operations at high altitudes where low indicated airspeeds yield high true airspeeds (MACH number) at high angles of attack.
The high angle of attack results in flow separation which causes buffet. Turning maneuvers at these altitudes increase the angle of attack and result in stability deterioration with a decrease in control effectiveness.
The relationship of stall speed to MACH crit narrows to a point where sudden increases in angle of attack, roll rates, and/or disturbances (e.g., clear air turbulence) cause the limits of the airspeed envelope to be exceeded. Coffin Corner exists in the upper portion (a qualitative term) of the maneuvering envelope for a given gross weight and G-force.
The key point being that as flight altitude increases the margin between stall speed and max mach steadily decreases, making crew and automation responses (and environmental perturbations) more critical to continued safe flight.
5. As the BEA noted in their second interim report, of the studied 13 unreliable airspeed events all were associated with turbulence (BEA 2009).
6. The core air data system comprises three homogenous redundant channels of pitot sensor -> dedicated sensor air data module -> air data inertial reference unit. Being a homogenous channel of course means that they are inherently vulnerable to common mode failures.
7. For more on what the crew of AF 447 were faced with and how they might have behaved, see my post here.
8. Finally the effectiveness of alternate law stall warning is also adversely affected by a reduction in airspeed because as the air-speed reduces the angle of attack at which a stall warning is generated also increases. Thus in circumstances where unreliable air data has reduced the measured air speed the stall warning protection is also reduced.