Error, iPads and attribution

21/11/2015 — 3 Comments

qantas-vh-vzr-737-838-640x353

Deconstructing a tail strike incident

On August 1 last year, a Qantas 737-838 (VH-VZR) suffered a tail-strike while taking off from Sydney airport, and this week the ATSB released it’s report on the incident. The ATSB narrative is essentially that when working out the plane’s Takeoff Weight (TOW) on a notepad, the captain forgot to carry the ‘1’ which resulted in an erroneous weight of 66,400kg rather than 76,400kg. Subsequently the co-pilot made a transposition error when carrying out the same calculation on the Qantas iPad resident on-board performance tool (OPT), in this case transposing 6 for 7 in the fuel weight resulting in entering 66,400kg into the OPT. A cross check of the OPT calculated Vref40 speed value against that calculated by the FMC (which uses the aircraft Zero Fuel Weight (ZFW) input rather than TOW to calculate Vref40) would have picked the error up, but the crew mis-interpreted the check and so it was not performed correctly.

The ATSB found that the most significant contributing factor was that the Flight Crew Operating Manual (FCOM) procedure for crew comparison of the calculated Vref40 speed, which was designed to assist in identifying a data entry error, could be mis-interpreted, thereby negating the effectiveness of the check. So, goes the ATSB’s narrative, this is clearly a human error and procedural issue (1) and all we need to do is tighten up on the procedural controls. I call this the ‘try harder’ approach to system safety.

An argument is defined by what it ignores and the perspectives it opposes (explicitly or implicitly)

Derrida

According to the ATSB narrative this is a story all about human error and procedural violation, which are implied to be inevitable. Unreflectingly we might accept this narrative, because it’s a powerful and traditional one. But rather than doing so let’s deconstruct the ATSB’s narrative a little and see where we end up. Applying Derrida’s rules of deconstruction we first set aside whether we think the ATSB’s narrative is ‘warranted’, and look for a counter narrative that can stand in opposition to it. We don’t have far to look for such an argument in this case. If we reject crew and procedural issues as being causally significant then we’re left with the design of the cybernetic system and in particular the interfaces as significant contributing factors.

ipad-app-qantas-737-opt-app-640x478

OPT showing erroneous TOW entered

Let’s look at the first officer’s transcription error (3) in that light. In this case the first officer calculated the correct TOW, then entered the most significant digit of the TOW as a ‘6’ rather than ‘7’. Transcription errors are a common outcome where the user is required to enter data as a string of digits. Considering this error as attributable to the interface rather than solely to the operator then opens up the possibility of doing something about it. For example we could substitute the use of a scroll wheel (as supported by the iPad) for the ordinary qwerty keyboard. Using a scroll wheel would reduce the likelihood of a straight transcription error because it requires both the deliberate ‘eyes down’ selection of a value as well as two separate actions, select and commit, to enter it (4). This sort of interface change would ideally eliminate (5) the occurrence of data transcription errors at point of entry into the system rather than allowing it to flow through the chain of information processing until it was detected using the Vref40 speed cross check (6).

If we step back a little and consider the wider view of our system another issue emerges. We are apparently asking the crew to act as a ‘data entry’ function between the hardcopy load sheet ‘system’ and the OPT ‘system’. Apparently the crew have to hand calculate the TOW based on the ZFW and fuel values on the load sheet then enter this into the OPT. This is known technically as a ‘kludge’, that point where system designers get the human operators to act as the glue between two poorly integrated systems. Clearly a bar-code on the load sheet or using electronic transfer of data direct to the OPT would obviate the need for data entry. Likewise using the OPT to calculate TOW based on a/c fuel state and load sheet ZFW would obviate the need for an error prone ‘handraulic’ calculation.

Now let’s look at our cybernetic system from an information flow perspective, here I draw on the work of Peter Ladkin and Bernd Sieker of the Faculty of Technology and Cognitive Interaction Technology Centre of Excellence (CITEC) at the University of Bielefeld. In their paper on applying formal methods to procedures they examined a number of like accidents and incidents. The pertinent point in this instance is the number of informationally independent values of TOW there are in the system (7) with only one point, the Vref40 cross-check at which  the consistency and correctness of TOW is confirmed. If we think about what’s happening here we are allowing hazardous information states to occur and propagate with only one single check standing between these and their use in flight, with potentially catastrophic consequences. The reliance on a single procedural check seems to me to also violate the no single point of failure principle of the FAA/JAA (8). Larkin and Siekr provide a number of recommendations as to how robust, exception proof, procedures (systems) can be developed. I wouldn’t characterise the current single checkpoint as meeting their standard.

A further point emerges when we consider the risk posed by the occurrence of errors propagating through the system. Now if we have had, as the ATSB chronicles in their report, a series of accidents and incidents due to data entry errors then this base error rate is what’s driving the accident rate. Estimating the likelihood of human error rates is notoriously difficult, but in this case having a crack at it might allow us to judge the subsequent effectiveness of the error detecting mechanism (is it really water-tight or quite leaky?). Another question on the risk front that could be asked is how likely is an accident to occur if the erroneous data is used. If most of the time errors don’t result in a hazardous scenario then the lower accident rate may not be demonstrating the effectiveness of the cross check activity. So what we may ask is the actual risk here? Are our assumptions as to the effectiveness of our safety measures valid?

So how warranted is the ATSB’s narrative? Well my conclusion is that their focus is too narrow and as the counter narrative shows there are broader system level issues that they fail to address. Consequently their conclusion as to the inadequacy of the FCOM procedure needs to be broadened to encompass the more general need for robust exception handling procedures. They also overly focus on error recovery ignoring opportunities for error elimination at source, through the design of the systems interfaces. Sometimes trying harder is not the required response, sometimes we actually have to change.

Notes

1. This conclusion is based on an examination of the identified contributing factors, issues and actions. All of which were focused upon human behaviour and procedures.

2. The system of people (crew and support staff in this instance), data (hard copy and electronic), hardware (iPads, FMC avionics) and computational software (FMC and OPT) and (often overlooked) the interfaces between them.

3.  The ATSB report terms this a transposition error, which is incorrect. Transposition is were two characters are swapped. This was not the case, instead the character ‘6’ was substituted for ‘7’. A transposition error would be to enter ’67’ instead of ’76’.

4. I’m here assuming the OPT utilised the standard iPad Qwerty style keyboard as this is consistent with the ATSB report.

5.  More likely significantly reduce.

6. This follows the system safety philosophy of eliminating hazards rather than than mitigating them. The hazardous state in this case is an undetected TOW within the processing chain. The Vref40 cross check detects and recovers from the error but does nothing to actually eliminate it.

7. The Ladkin/Sieker paper’s meta-model posit’s three models of the system a requirements, cognitive and procedural model. They develop a formal logical proof that the procedural model formally emulates the cognitive model, and that the cognitive model formally emulates the requirement model.

8. Yes I know that it was two pilot errors, however I’d argue that given that people make ‘mistakes’ all the time it’s appropriate not to consider human error in the same fashion as we consider equipment failure. If that doesn’t convince I’d also argue that a series of interruptions in the cockpit to which both pilot’s are exposed constitutes a common cause failure mode. If you still want to argue we can get out the FAA/JAA’s ancillary criteria for tolerance of multiple failures.

3 responses to Error, iPads and attribution

  1. 

    This is an interesting kind of error. Aside from how it occurs (data input error, or whatever), the conditions which set it up can probably be traced back to “on time” performance. There is a deadline, which is the scheduled time of departure; but there is also a turnaround time for unloading, cleaning, fuelling, loading and preparing the aircraft for departure. For efficient aircraft utilisation, airlines will try to reduce the turnaround time to the minimum possible which means critical ZFW information may only be produced minutes before departure. This time just before departure is also a time when interruptions, such as those mentioned by Peter Ladkin and Bernd Sieker in their article, are most likely to occur (e.g: Calling ATC for clearance, confirming all paxes on board, signing load sheets, etc.). The result is that when crew’s most need the time to check and crosscheck what they are doing, they have the least amount of time to do it and are most likely to get interrupted. Data entry errors are bound to be affected by this, so maybe strategies to reduce these errors should look at mitigating what is happening at this point.

    • 
      Matthew Squair 23/11/2015 at 10:57 pm

      The scenario you describes has a lot in common with health care it seem to me. A bit of cross pollination could prove useful perhaps?

      • 

        Interesting possibility. The health field has been using flight trainers to help them set up CRM (Crew Resource Management) programmes. The scenario was: Old fashioned cockpit had authoritarian Capt and a crew who all followed him to their deaths because no one had the strength to point out the shortcoming of his plan. Contrast that to the surgeon who is used to being God in the operating room and who follows operating procedures that everyone knows are incorrect but don’t have the strength to stop him. Aviation (at least in the West) has significantly cut back on the old fashioned Capt scenario using CRM, and significantly increased safety as a result. Now hospitals want to do the same in the operating theatre to improve patient survival rates.

        What we are now finding is that human biology is the limiting factor in improving safety. The Quantas crew did not knowingly enter the wrong data. Instead, the error was a function of our biology/psychology. There maybe other areas where the health profession may have useful input, but in this case improvements in aviation safety need to be directed at accounting for these kinds of human error.

        I also think your thesis that tweaking the checklist may not really solve the problem is correct. To do that, the whole “system” needs to be looked at to introduce some steps earlier on which will protect against these kinds of error.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s