Airbus voting logic and QF72

20/06/2009 — 3 Comments

QF 72 (Image Source: Terence Ong)

This post is part of the Airbus aircrafty family and system safety thread.

The QF72 incident

Introduction

On the 7 October 2008, a QANTAS Airbus A330-303 aircraft, designated flight QF72 departed Singapore on a scheduled service to Perth, Australia. While en-route the aircraft suffered a series of un-commanded pitch over manoeuvres that caused injury to passengers and crew as they were thrown into the cabin roof and luggage bins. This post looks at how a series of assumptions and architectural decisions led to the un-commanded upset.

The accident

Whilst enroute the following sequence of events unfolded:

  1. At 1240:28, while the aircraft was cruising at 37,000 ft, the autopilot abruptly disconnected accompanied by various onboard system failure indications.
  2. At 1242:27 the aircraft abruptly pitched nose-down. The aircraft reached a maximum pitch angle of about 8.4 degrees nose-down, and descended 650 ft during the event.
  3. After returning the aircraft to 37,000 ft, the crew commenced actions to deal with multiple failure messages.
  4. At 1245:08, the aircraft commenced a second uncommanded pitch-down event. The aircraft reached a maximum pitch angle of about 3.5 degrees nose-down, and descended about 400 ft during this second event.

Whilst the passengers and crew were lucky that these events did not happen during landing or takeoff numerous passenger injuries did occur as unrestrained passengers and cabin crew were thrown into the cabin roof and luggage bins.

Causal factors

The ADIRU failure. The aircraft’s Flight Data Recorder (FDR) recorded spikes in various Air Data Inertial Reference Unit (ADIRU) 1 sensed and calculated aircraft flight parameters including angle of attack, pressure altitude, computed airspeed, mach number, static air temperature, pitch angle, roll angle, wind speed and wind direction. The spikes appeared to be random in nature and occurred for different parameters at different times. The first AOA 1 spike occurred at 0440:34 UTC. AOA 1 values changed from +2.1 degrees to +50.6 degrees and back to +2.1 degrees over three successive samples. In total, 42 AOA 1 spikes were recorded before the aircraft touched down at Learmonth. One of the recorded AOA spikes occurred at 04:42:26 UTC, immediately prior to the first pitch-down (04:42:27). Another of the recorded spikes occurred at 04:45:08 UTC, immediately prior to the second pitch-down (04:45:09). Both of those spikes had a magnitude of +50.6 degrees.

Numerous master caution, stall and overspeed warnings were recorded as well as a manual crew initiated change over of the captain’s Primary Flight Display (PFD) data source from ADIRU 1 to ADIRU 3, possibly due to erratic flight parameter indication. A time line of events derived from the ATSB report is provided below.

A post-flight download of maintenance data from the Central Maintenance Computer (CMC) found that, between 0440 and 0442, several aircraft systems had also detected an ADIRU 2 fault. Despite numerous and related fault messages being recorded, including faults relating to navigation systems, pitot sensor heating and the  Flight Control Primary Computer (PRIM), subsequent bench tests of equipment found only anomalous behaviour of ADIRU 1.

Architecture & fault tolerance. The architecture of the A330 flight controls can generally be described as being a triple redundant non-homogenous voting architecture. In this architecture a vote is taken on the input or output of a channel and the majority rule is applied to determine the validity of data (1).

For most of the ADIRU parameters, the PRIMs obtain three different values of the same parameter. Each value comes from a different sensor processed by a different ADIRU. The PRIMs compare the value of the parameter coming from each ADIRU. If the value of any of the parameters differed from the median (middle) value (2) by more than a threshold amount for more than a set period of time, then the relevant part (that is, ADR or IR) of the associated ADIRU is no longer be used by the PRIMs. In addition, for all ADIRU parameters except AoA, when all three values are valid, the median value is used for calculating the flight control commands. The use of the median values is considered by the OEM (AirBus) to be a robust algorithm in the face of any error from one data source as the median statistic (3) is less sensitive to outlier values than the arithmetic mean (4) which is of value if we expect a sensor to fail to an extreme value such as the extreme of the measured scale.

However in the case of the angle of attack data processing, things are slightly different because AoA 1 & 2 sensors could give different values to that of AoA 3 (for example during sideslip) because AoA 1 & 2 are mounted on one side of the aircraft while AoA 3 is mounted on the other. For angle of attack data, the PRIMs compare the median value from all three ADIRUs with the value from each ADIRU. If the difference was greater than a set value for more than 1 second continuously, then the PRIM flags the ADR part of the associated ADIRU as faulty and ignore it’s data for the remainder of the flight.

To calculate the value of AOA to use for calculating flight control commands, the PRIMs use the average (aririthmetic mean) (4) value of AOA 1 and AOA 2. This value is passed through a rate limiter to prevent rapid changes in the value of the data due to short-duration anomalies, e.g. due to turbulence. If the difference between AoA 1 (or AoA 2) and the median value from all three ADIRUs is higher than a set value, the PRIMs use a memorised last valid average value for a period of 1.2 seconds. After 1.2 seconds, the current average value is then used.

Such a ‘coast through’, strategy is often used in flight control systems, by discarding a small set of bad inputs the system can ‘coast’ for a short period until better inputs are subsequently received. However, because the angle of attack voting gate is set to one second, a spike of less than one second cannot be detected and voted out. Therefore if either AoA 1 or 2 generated the spike this value is used to calculate the angle of attack in the PRIM. The spike angle of attack value would then cause the PRIM to discard and use the last valid average angle of attack value calculated. If the first spike is then followed by a second spike that is still present after 1.2 seconds from the first detection of an unacceptable angle of attack this second spike will be used by the ADIRU to calculate the current average value which is then processed as valid data by downstream equipment such as the PRIM. As the PRIM flight control logic treats the spike angle of attack as valid the PRIM commands an automated pitch manoeuvre due to high angle of attack protection (called alpha prot) or anti pitch-up compensation flight control functions.

Root causes

Underlying the vulnerability outlined above is an assumption as to the probability distribution of the time between sequential noise events in the system. If we assume that noise is completely random then the fault tolerance scheme above is reasonable. If however we assume that noise is ‘bursty’ or impulsive in nature and that the occurrence of spikes is therefore not independent then this fault tolerance scheme may be less than ideal. Using the arithmetic mean (rather than geometric) also means that the angle of attack algorithm is more vulnerable to extreme values in one sensor channel (5). This assumption is an important but unstated, and unchallenged, part of the fault hypothesis for the flight system. A follow on question would be whether there are any other possible sources of such ‘bursty’ noise in the system.

The aircraft manufacturer Airbus advised the ATSB that angle of attack spikes may occur on many flights, but in its experience, there were usually only a very small number of spikes on any particular flight. Airbus was not aware of any previous event where AOA spikes had met the above conditions and resulted in an in-flight upset. The problem with relying upon the reported rarity of such incidents is that for a system intended to have an extremely high reliability it is very unlikely we would have experienced such events in the field and even less likely that designers would have knowledge of them. Unfortunately for high integrity systems a lack of evidence cannot be used as evidence of lack.

The A330 voting logic also checks for reasonableness by comparison amongst a set of sensed values. But this is not the only way reasonableness checks can be performed, another check would be whether the magnitude of the change is physically possible within the period of time given, for example is it even possible for the aircraft angle of attack to go from +2.1 degrees to +50.6 degrees (4) in two successive sample periods? If not then a simple magnitude check could be used in addition to the channel voting logic to further reduce system vulnerability (5).

Lessons learned?

While the events that occurred on board QF72 and the causal factors culminating in the accident may seem to be highly unlikely, a review of other accidents & incidents involving flight control systems histories shows that such system vulnerabilities are not as uncommon as we might think. As part of the post incident investigation Airbus searched the it’s on line maintenance database for reports which contained a similar pattern of fault messages, four matching sets were identified. This result highlights both the value of such data-mining techniques in testing the assumptions made during the development process, as well as their necessity. The undetected presence of such anomalies in recorded data also indicates a need to actively search for anomalies as part of an ongoing safety program.

The USN’s F/A-18 aircraft have also experienced a number of accidents due to failure of the dual redundant angle of attack probe air data system. In one accident the AoA probe failed to an extreme position during the catapult stroke but prior to the ‘weight off wheels’ state, the flight control system recognised the failure and reverted to an assumed safe AoA probe value sampled prior to weight off wheels leading to an unrecoverable pitch transient and loss of aircraft and crew.

Conclusions

An obvious conclusion from the QF72 accident is that assumptions on which critical performance is based should always be explicitly stated and subject to challenge as soon as evidence becomes available.

From an architectural perspective the interdependence of AOA 1, 2 & 3 signals subverts the triple redundancy of the A330 architecture. This is a subtle example of common cause failures, i.e. because the AOA signals are not independent a modified voting strategy must be adopted that then introduces a single point of failure mode when assumptions about the randomness of noise in the channel break down. Perhaps truly triplex redundant left and right AOA sensors positions would be more appropriate?

The inclusion of simple reasonableness checks based on the known flight performance of the vehicle could have provided a redundant and dissimilar barrier to fault propagation that would have reduced vulnerability to the this specific failure mode.

As Trevor Kletz once remarked of safety in the process industry, ‘once an explosive mixture has formed experience shows that sooner or later a source of ignition turns up’, we might observe that once a software fault exists, sooner or later a trigger for it will turn up, eliminating one potential trigger (a faulty ADIRU) does not mean that we have eliminated all the potential triggering sources, such as other flight control components or the external environment.

Notes

1. One of the vulnerabilities of such an architecture is asynchronicity of process. For example if sensor values are sampled at slightly different rates or they monitor the controlled process at slightly different positions then the data in each channel becomes asynchronous, this can be further exacerbated by processing time differences in the feed forward loop leading to ‘slightly out of sequence/specification’ modes of failure. Resistance to asynchronicity can be obtained through widening the voting gates but this is at the expense of increased failed sensor ‘average drag’ which then leads to extreme ‘bumps and thumps’ when a channel is cut in or out of the control logic.

2. A median value is defined as the sensor value separating the higher half of the set of sensor values from the lower half. For three sensors using the median value means that a specific sensor is selected as the ‘truthful’ sensor for the purpose of calculating flight control commands.

3. The use of the median statistic implies a decision theory based on an absolute-difference loss function.

4. The use of the arithmetic mean statistic implies a decision theory based on an Taguchi or squared error loss function at this point.

5. As per the ATSB report for the A330, during all phases of flight, the typical operational range of angle of attack is +1 degree to +10 degrees. In cruise, a typical angle of attack is +2 degrees.

6. See also my post Averages, Voting and System Robustness.

References

1. ATSB Transport Safety Report, Aviation Occurrence Investigation, AO-2008-070 Interim Factual Report on in-flight upset 154 km west of Learmonth, WA 7 October 2008, VH-QPA Airbus A330-303.

3 responses to Airbus voting logic and QF72

  1. 
    imran akhtar 24/09/2011 at 4:00 am

    Interesting analysis. Are you aware of the Malaysian Airlines B777 in flight upset in the same region due to a faulty accelerometer? Link to the report below:

    http://www.atsb.gov.au/publications/investigation_reports/2005/aair/aair200503722.aspx

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s