Deconstructing a tail strike incident
On August 1 last year, a Qantas 737-838 (VH-VZR) suffered a tail-strike while taking off from Sydney airport, and this week the ATSB released it’s report on the incident. The ATSB narrative is essentially that when working out the plane’s Takeoff Weight (TOW) on a notepad, the captain forgot to carry the ‘1’ which resulted in an erroneous weight of 66,400kg rather than 76,400kg. Subsequently the co-pilot made a transposition error when carrying out the same calculation on the Qantas iPad resident on-board performance tool (OPT), in this case transposing 6 for 7 in the fuel weight resulting in entering 66,400kg into the OPT. A cross check of the OPT calculated Vref40 speed value against that calculated by the FMC (which uses the aircraft Zero Fuel Weight (ZFW) input rather than TOW to calculate Vref40) would have picked the error up, but the crew mis-interpreted the check and so it was not performed correctly.
The ATSB found that the most significant contributing factor was that the Flight Crew Operating Manual (FCOM) procedure for crew comparison of the calculated Vref40 speed, which was designed to assist in identifying a data entry error, could be mis-interpreted, thereby negating the effectiveness of the check. So, goes the ATSB’s narrative, this is clearly a human error and procedural issue (1) and all we need to do is tighten up on the procedural controls. I call this the ‘try harder’ approach to system safety.
An argument is defined by what it ignores and the perspectives it opposes (explicitly or implicitly)
According to the ATSB narrative this is a story all about human error and procedural violation, which are implied to be inevitable. Unreflectingly we might accept this narrative, because it’s a powerful and traditional one. But rather than doing so let’s deconstruct the ATSB’s narrative a little and see where we end up. Applying Derrida’s rules of deconstruction we first set aside whether we think the ATSB’s narrative is ‘warranted’, and look for a counter narrative that can stand in opposition to it. We don’t have far to look for such an argument in this case. If we reject crew and procedural issues as being causally significant then we’re left with the design of the cybernetic system and in particular the interfaces as significant contributing factors.
Let’s look at the first officer’s transcription error (3) in that light. In this case the first officer calculated the correct TOW, then entered the most significant digit of the TOW as a ‘6’ rather than ‘7’. Transcription errors are a common outcome where the user is required to enter data as a string of digits. Considering this error as attributable to the interface rather than solely to the operator then opens up the possibility of doing something about it. For example we could substitute the use of a scroll wheel (as supported by the iPad) for the ordinary qwerty keyboard. Using a scroll wheel would reduce the likelihood of a straight transcription error because it requires both the deliberate ‘eyes down’ selection of a value as well as two separate actions, select and commit, to enter it (4). This sort of interface change would ideally eliminate (5) the occurrence of data transcription errors at point of entry into the system rather than allowing it to flow through the chain of information processing until it was detected using the Vref40 speed cross check (6).
If we step back a little and consider the wider view of our system another issue emerges. We are apparently asking the crew to act as a ‘data entry’ function between the hardcopy load sheet ‘system’ and the OPT ‘system’. Apparently the crew have to hand calculate the TOW based on the ZFW and fuel values on the load sheet then enter this into the OPT. This is known technically as a ‘kludge’, that point where system designers get the human operators to act as the glue between two poorly integrated systems. Clearly a bar-code on the load sheet or using electronic transfer of data direct to the OPT would obviate the need for data entry. Likewise using the OPT to calculate TOW based on a/c fuel state and load sheet ZFW would obviate the need for an error prone ‘handraulic’ calculation.
Now let’s look at our cybernetic system from an information flow perspective, here I draw on the work of Peter Ladkin and Bernd Sieker of the Faculty of Technology and Cognitive Interaction Technology Centre of Excellence (CITEC) at the University of Bielefeld. In their paper on applying formal methods to procedures they examined a number of like accidents and incidents. The pertinent point in this instance is the number of informationally independent values of TOW there are in the system (7) with only one point, the Vref40 cross-check at which the consistency and correctness of TOW is confirmed. If we think about what’s happening here we are allowing hazardous information states to occur and propagate with only one single check standing between these and their use in flight, with potentially catastrophic consequences. The reliance on a single procedural check seems to me to also violate the no single point of failure principle of the FAA/JAA (8). Larkin and Siekr provide a number of recommendations as to how robust, exception proof, procedures (systems) can be developed. I wouldn’t characterise the current single checkpoint as meeting their standard.
A further point emerges when we consider the risk posed by the occurrence of errors propagating through the system. Now if we have had, as the ATSB chronicles in their report, a series of accidents and incidents due to data entry errors then this base error rate is what’s driving the accident rate. Estimating the likelihood of human error rates is notoriously difficult, but in this case having a crack at it might allow us to judge the subsequent effectiveness of the error detecting mechanism (is it really water-tight or quite leaky?). Another question on the risk front that could be asked is how likely is an accident to occur if the erroneous data is used. If most of the time errors don’t result in a hazardous scenario then the lower accident rate may not be demonstrating the effectiveness of the cross check activity. So what we may ask is the actual risk here? Are our assumptions as to the effectiveness of our safety measures valid?
So how warranted is the ATSB’s narrative? Well my conclusion is that their focus is too narrow and as the counter narrative shows there are broader system level issues that they fail to address. Consequently their conclusion as to the inadequacy of the FCOM procedure needs to be broadened to encompass the more general need for robust exception handling procedures. They also overly focus on error recovery ignoring opportunities for error elimination at source, through the design of the systems interfaces. Sometimes trying harder is not the required response, sometimes we actually have to change.
1. This conclusion is based on an examination of the identified contributing factors, issues and actions. All of which were focused upon human behaviour and procedures.
2. The system of people (crew and support staff in this instance), data (hard copy and electronic), hardware (iPads, FMC avionics) and computational software (FMC and OPT) and (often overlooked) the interfaces between them.
3. The ATSB report terms this a transposition error, which is incorrect. Transposition is were two characters are swapped. This was not the case, instead the character ‘6’ was substituted for ‘7’. A transposition error would be to enter ’67’ instead of ’76’.
4. I’m here assuming the OPT utilised the standard iPad Qwerty style keyboard as this is consistent with the ATSB report.
5. More likely significantly reduce.
6. This follows the system safety philosophy of eliminating hazards rather than than mitigating them. The hazardous state in this case is an undetected TOW within the processing chain. The Vref40 cross check detects and recovers from the error but does nothing to actually eliminate it.
7. The Ladkin/Sieker paper’s meta-model posit’s three models of the system a requirements, cognitive and procedural model. They develop a formal logical proof that the procedural model formally emulates the cognitive model, and that the cognitive model formally emulates the requirement model.
8. Yes I know that it was two pilot errors, however I’d argue that given that people make ‘mistakes’ all the time it’s appropriate not to consider human error in the same fashion as we consider equipment failure. If that doesn’t convince I’d also argue that a series of interruptions in the cockpit to which both pilot’s are exposed constitutes a common cause failure mode. If you still want to argue we can get out the FAA/JAA’s ancillary criteria for tolerance of multiple failures.