Glass cockpits, event buckets and implicit design assumptions
Reading the ATSB interim report on the QF72 in-flight accident one could easily overlook the statement, “…the crew reported that the (ECAM (1)) messages were constantly scrolling, and they could not effectively interact with the ECAM to action and/or clear the messages.” (ATSB report AO2008070). So during the QF 72 event the crews primary display (interface) with the aircraft was significantly degraded, the question is of course why?
A quick background to the QF72 accident is that a series of ‘spike’ AoA errors, generated by an intermittent ADIRU fault, propagated through the flight management system causing a cascading series of equipment fault and aircraft flight warnings, and triggering the aircraft’s high alpha protection function. Obviously the failure of a primary display during such an event adds to crew workload and stress levels, not a desirable trait in the human machine interface of any aircraft. It’s also a safe bet that this behaviour was definitely not intended by the Airbus engineers. For a fuller description see my previous post here or the ATSB report referenced below.
To understand why this happened we need to firstly take a slight detour into the realms of cognitive engineering. Responding to a rapid sequence of events such as occurred during the QF72 accident requires the flight management system to interact asynchronously with the crew, especially for high priority events such as warnings and cautions. However humans are inherently limited in their ability to respond to multiple tasks, in fact as work load increases we reach a point at which our performance starts to fall away rapidly as we become situationally overloaded.
To deal with the problem of situational overload a queue or list of event data is created (sometimes termed an ‘event bucket’) with new events pended into the queue to await crew disposition. Because flight displays are limited in physical size, such a list can also easily exceed the available display area (2). To address this physical display constraint an event list is normally managed in one of two ways, the first (apparently adopted by AirBus) is to append events to the list and automatically scroll the list to maintain a view of the last (or priority) set of events, the second is to maintain a static view of the messages and provide a crew initiated paging function.
The problem with a scrolling implementation is that if the crew are dividing their attention between competing activities it is disconcerting to say the least to look back and find that the events you were working on have disappeared from the display. This sort of scenario is normally greeted by comments such as, ‘where the $#@! has X gone…”. Where the number of new events in the queue is increasing rapidly, as it was in this case, the update rate if unbounded can exceed the cognitive ability of the crew, leading to them missing events in the list.
In extreme circumstances the display can simply become unusable, as it did here, and crew will completely ignore it until after the incident is resolved and cockpit work load has dropped to a reasonable level. The problems introduced by scrolling event displays have been recognised for a number of years, and in fact led the US Nuclear Regulatory Commission to preclude the use of scrolling alarm lists in US reactor alarm systems (O’Hara et al. 1994).
As the ECAM display layout above illustrates, ECAM messages are displayed in the lower quarter of the E/WD display further reducing the available display area. AirBus engineers also provided an operational checklist to the crew which further reduces the number of events that can be displayed. The fairly obvious design assumption that emerges here is that multiple independent faults (3) were considered extremely unlikely and that therefore a display optimised towards a ‘single event –> single response’ has eventuated.
The A330 ECAM software also automatically prioritises ECAM messages and updates the display accordingly, again this is a reasonable ‘first things first’ approach to event management, but in a situation of rapidly arriving independent events this also meant that the display update rate was driven outside the ability of the crew to respond effectively.
Noting the implicit assumption embodied in the design of the ECAM interface it is unlikely (and I’m guessing here) that any specification of either a maximum display update rate or minimum hysteresis time constraints were placed on the ECAM design. As a result the ECAM display was unable to provide meaningful data to the crew during the QF72 in-flight accident.
For those of you wondering what hysteresis has to do with anything, when designing an interface you need to remember that humans require a finite amount of time to identify that a displayed value or list item has changed, i.e. we have an in-built hysteresis in our response to changing circumstance. For example, if the software updates the event queue and sets a new event as the priority the crew could actually be responding to the previous event because they haven’t recognised the change. If a command, such as clearing the list item, arrives within this hysteresis time it may be advisable to hold the action and query the crew as to their intent.
There are two key conclusions that can be drawn from this failure. The first is that fundamental design assumptions, such as fault hypotheses, can work their way through the entirety of a system with unexpected consequences if they prove false. The second is that any display should be designed to not exceed the cognitive limitations of the crew and these limits need to be explicitly specified.
1. ECAM is the Electronic Centralised Aircraft Monitor that provides information on the status of the aircraft and its systems on two display units. The upper unit or engine/warning display (E/WD) presents information such as engine primary indications, fuel quantity information and slats/flap positions. It also presents warning or caution messages when a failure occurs, and memo messages when there are no failures. The lower or system display (SD) presents aircraft synoptic diagram and status messages (Spitzer, 2006). The lower ECAM display can be used to display overflow status messages, however incoming priority messages (Warnings, Cautions and Advisories relating to out of tolerance parameters) will automatically pre-empt the status display with system schematics associated to that priority message.
2. During the Three Mile Island incident the event queue buffer became so full of events that the control room printer ran two and a half hours behind what was actually occuring.
3. This is not the same as the situation of dependent system failures. In these circumstances if a fault occur that results in a cascade of other system faults, ECAM should identify the originating fault, and automatically present the operational checklist to the crew.
ATSB Transport Safety Report, Aviation Occurrence Investigation, AO-2008-070 Interim Factual Report on in-flight upset 154 km west of Learmonth, WA 7 October 2008, VH-QPA Airbus A330-303.
O’Hara, J., Brown, W., Higgins, J., & Stubler, W., Human Factors Engineering Guidelines for the Review of Advanced Alarm Systems (NUREG/CR-6105). Washington, D.C, U.S. Nuclear Regulatory Commission, 1994.
Spitzer, C.A., Digital Avionics Handbook, Second Edition, CRC Press, 2006.