One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?
Case Study: The 1993 Warsaw A320 accident
On touch down modern aircraft use a combination of speed (air) brakes, reverse thrust and wheel brakes to decelerate to a safe speed on the runway. Within an aircraft’s computers there’s a complex and distributed set braking logic that is intended to ensure that braking is safely and effectively applied. By safe I mean (for example) not having thrust reversers inadvertently deploy in mid air, as occurred in the Lauda jet disaster.
In the 1993 Warsaw accident the aircraft, an A320, landed on one landing gear, due to the crew acting on incorrectly reported crosswind conditions. As both landing gear depressed is required as an indication of the aircraft having made the transition from Air to Ground (AG) this led to the Landing Gear Control Interface Unit (LGCIU) validly not reporting an AG transition (AG signal remained FALSE).
Due to water on the runway the MLG wheel that was in contact aquaplaned across the surface and did not ‘spin up’. As the wheels are required to be spinning at 72 kts equivalent ground speed before a Wheel Spinning (WS) can be output the LGCIU a Wheel Spinning (WS) condition was validly not reported (WS signal remained FALSE).
When flaps are extended to full in the A320 aircraft reverse thrust are interlocked (as implemented in the Full Authority Digital Engine Controller (FADEC)) such that application requires the AG signal being TRUE.
Similarly spoiler commands are interlocked (as implemented in the Spoiler Elevator Computer (SEC) #2) such that application requires the following:
- AG signal TRUE AND RADALT (Altitude l.t 10 ft) signal TRUE (1) or
- WS signal TRUE
Finally within the Brakes and Steering Control Unit (BSCU) wheel brake application commands in Autobrake Mode are interlocked with the wheel speed (WSP) hitting 80% of a computed reference (WSPREF) value (normally ground speed).
The combination of the three missing outputs from the LGCIU (AG signal, WS signal and WSP l.t (80% WSPREF) meant that no aircraft braking sub-system engaged (2). As a result the aircraft skidded off the end of the runway and collided with an earth rampart, with predictably dire consequences.
In this accident each of the subsystems worked as per its specification but when they interacted within a specific operational context together they failed to provide a required service when demanded.
System accidents and Interaction Hazards
In safety engineering parlance this type of accident is termed a system accident. In system accidents causal factors are never singular and rarely clear cut but rather a set of complex, and usually highly coupled, interactions between system components, the operators and the environment (3). If the potential for these sort of unsafe interactions exists within a system then we could state that a hazard exists, and for the purposes of discussion call it an interaction style of hazard (4).
Why wasn’t this hazard identified?
What we are looking at here is a situation in which system (or aircraft level) functionality had been decomposed and allocated to a number of sub-systems.
From a safety management perspective this sort of distributed functional design is challenging to manage because, for argument, if we have the prime contractor supplying the SEC, subcontractor A the LGCIU, subcontractor B the FADEC and subcontractor C the BSCU then we strike a difficulty in linking upwards from subsystem level safety analyses to system effect level effects.
In such an environment of commercial, project and technical interfaces it’s entirely possible that a safety analysis performed on the LGCIU by subcontractor A could identify certain fault conditions in the landing gear, but lacking the system context not bridge the gap to identify the consequences at the aircraft level.
Failing to recognize this shortfall in subsystem analyses leaves us vulnerable at the system level to the undetected presence of interaction style hazards in the system of interest.
Could a specific type of hazard analysis have detected this?
Looking through the MIL-STD-882C toolkit of safety analyses (5) we find the System Hazard Analysis (SHA) which according to the standard is intended to look at the interrelationship of subsystems for hazardous interactions (amongst other things).
205.2.1…demonstrates the subsystems’ interrelationships for…possible independent, dependent and simultaneous hazardous events, including system failures, failures of safety devices, common cause failures and events, and system interactions that could create a hazard or result in an increase in mishap risk. MIL-STD-882C
This seems to be what we’re looking for, as it’s aimed at interrelationships and system interactions that could cause a hazard. But if this is the desired objective of the SHA, what’s a methodology to achieve this ?
So how to actually do this?
One tried and true technique is to develop an overall fault tree analysis (6) starting with a top level event (TLE). In this case the TLE would be ‘loss of braking on landing’ and would work down from aircraft level effect to the lowest component level to identify hazardous combinations of component failures and environmental conditions.
In this case a cursory examination of the accident scenario identifies that WS and WSP g.t.e (WSPREF 80%) are not independent inputs as they are generated by the same component (the wheel speed sensor) which senses a single parameter (tyre rotation).
A properly constructed FT for the aircraft TLE would reasonably be expected to show that loss of the above three signals would generate a loss of braking and and identify common causes. In this case that the loss of a leg speed sensor output would lead to loss of WS and WSP g.t.e (80% WSPREF) output signals for that landing leg.
Introducing operational modes
Now if the FT was focused purely on component failures we might still not identify this as a particularly high risk hazard because it would require more than one independent failure.
However if we were to integrate landing modes into the FTA and consider scenario’s where only one landing gear was cycled due to landing in a severe cross wind (a designed for scenario per the logic) the failure of a WS sensor would become potentially a Single Point of Failure (SPOF) and it’s failure much more critical.
The natural ‘engineering’ approach would be to eliminate the SPOF by providing a redundant WS sensor, however we still (as per MIL-STD-882’s definition) should consider common mode and common cause failure modes.
Dealing with induced (common cause) failures
Broadening the FTA further to consider the wheel inputs to the speed sensor and erroneous inputs we could conceivably consider that the WS sensor might ‘fail’ because of the presence of water on the runway. The FTA would then identify that a combination of a cross wind landing on a contaminated runway (flooding, snow or ice for example) could lead to a loss of aircraft brakes. Our duplicated WS speed sensor doesn’t save us here as the input is the same to each sensor.
Having identified that the right combination of external environmental and operational conditions could cause a failure of the braking system should trigger a careful consideration of the likelihood of such an event occurring in the life of the aircraft, and whether or not we can and should rework the system logic to eliminate the likelihood of it’s occurrence.
One key point to note here is when we start looking outside the system at erroneous or unanticipated inputs and their effects we should also consider whether the inputs are themselves independent. For example if we have rapid changes in wind direction due to storm activity do we expect it to be dependent or independent of heavy rain?
The above is intended to be an example of how a System Hazard Analysis (SHA) using the Fault Tree technique can be used to identify an interaction type hazard that we infer from the causal factors identified in the 1993 Warsaw accident. There are a couple of key caveats however.
The first is that we don’t know what safety analyses were or were not conducted for the original certification of the A320 nor whether they identified this combination of environmental conditions as a potential hazard, or finally whether that hazard (if identified) was assessed as acceptable or at least tolerable.
The second is that, as you’ll note above, what is identified as a hazardous failure is very much dependent upon the scope of the analysis. If an analysis purely considers internal failure modes then you will get an answer that’s significantly different from one in which you broaden the scope to include external conditions, or the various operational modes. When environmental factors are identified as potential common causes, in turn the likelihood of their occurring also needs to carefully evaluated.
Finally it would seem to be reasonable to me that when an accident occurs the investigation authorities review the OEM’s system hazard analysis (or equivalent) to determine whether the causal factors were identified and as a result whether the safety analysis process can be improved. I’ll note that in the case of the Warsaw accident this did not occur.
1. MIL-STD-882C, System Safety program Requirements, 1993.
2. Perrow, C., Normal Accidents: Living with High-Risk Technologies, Basic Books, New York, 1984.
1. Wheel Spinning (WS) is a permissive as long as one main wheel is on the ground and the Radar Alt is less than 10 ft.
2. Note that a further contribution to the delay in braking was that the aircraft was high in the initial approach (due to a pilot decision) and ‘floating’ due to excessive speed (due to the misreported wind shear). Fliying high and fast on the approach delayed washing off the induced lift and further delayed triggering of the landing gear switches as the aircraft squatted.
3. Also termed ‘normal’ accidents by Charles Perrow (1984). I am not however as pessimistic as Perrow in terms of our ability to identify such hazardous interactions and deal with them.
4. Similar to the covert channel concept of computer security.
5. I’ve steered away from a discussion of ARP 4574 and ARP4761 safety assessment techniques as (IMO) the added complexity of that approach detracts from the fundamental argument.
6. Although I’ve used Fault Tree Analysis in this example there are other safety analysis techniques that could serve, see for example Event Trees or Cause Consequence Analysis.
The key is to use a technique that supports the analysis of the safety critical thread of functionality that runs through the components that make up your system of interest.