Fail-safe and triplex redundancy
The statement byAirbus regarding the robustness of the Airbus AOA median value voting logic disclosed in the ATSB QF72 accident report raises some interesting questions as to what was actually meant by the terms ‘robustness’ and ‘fail safe’?
Certainly the use of median value voting logic has the advantage of avoiding sensor, average drag as a sensor value moves towards the threshold and as a sample statistic it handling outlier skewing of the data set better than the mean statistic. From a software engineering perspective it also fits a triple redundant architecture and processing is a simple (and quick) comparison and selection task, which is important in a real time control loop. What remains unstated however is that the robustness of the Airbus AoA sensor voting logic is in the context of a fail safe design as described in FAR advisory circular AC 25.1309-1A (1988). AC 25.1309 states that the regulations are based on and incorporate the principle of a fail safe design, describing it as:
5.(a) … (1) in any system or subsystem, the failure of any single element, component, or connection during any one flight should be assumed, regardless of its probability….; and
… (2) that subsequent failures during the same flight, whether detected or latent, and combinations thereof, should also be assumed unless their joint probability is shown to be extremely improbable…
As the stated purpose of the circular is to describe acceptable methods of demonstration compliance with the FAR this becomes the ‘default’ design strategy for modern aircraft and the AirBus A330 flight control architecture in this particular instance. When Airbus states that a triple redundant voting scheme is robust in the face of failure it should be understood that it is strictly in response to the first part of this hypothesis, that is a single failure. This is becasuse as long as we can demonstrate that the probability of N+1 failures is sufficiently remote we do not have to demonstrate robustness for multiple failures.
But what happens if the hypothesis is violated and more than one failure occurs? An experimental evaluation of seven voting algorithms in a variety of simulated multiple error scenarios for a triple redundant configuration showed that the median voter produced the largest number of correct results, but also the the largest number of catastrophic errors (Bass et al, 1997). So in the context of the FAA fault hypothesis Airbus engineers are absolutely correct in their statements as to the robustness and fail-safeness of that design, but once that hypothesis is violated, it turns out that the median value voting algorithm becomes the most hazardous (1). Is an algorithm design that when it fails will generate the most hazardous output one which could be considered ‘fail safe’?
The take-home from this result is that for safety critical triple redundant systems using median value voting schemes we have to pay very careful attention to any common causal factor that increases the likelihood of multiple failures. In the case of AOA probes such factors include the dependency between left and right sensors, e.g. in certain cases we can expect sensor values to differ, thereby negating redundancy. Likewise the colocation of sensors, and the use of similar hardware also reduces the independence of these sensor channels.
1. ATSB, ATSB Transport Safety Report, Aviation Occurrence Investigation, AO-2008-070 Interim Factual Report on in-flight upset 154 km west of Learmonth, WA 7 October 2008, VH-QPA Airbus A330-303.
2. Bass, J.M., Latif-Shabgahi, G., Bennett, S., Experimental Comparison of Voting Algorithms in Cases of Disagreement, Euromicro, pp.516, 23rd EUROMICRO Conference ’97 New Frontiers of Information Technology, 1997.
3. FAA, Federal Advisory Circular, System Design and Analysis, AC25.1309-1A, June 1988.
1. From a risk perspective we also know that the sample median statistic is an unbiased estimator that minimises the loss (read risk) with respect to the absolute deviation loss function. That is the median value assumes a linear increasing loss value as we move away from the actual value in either direction. A question that this raises is whether for a flight control sensor input this assumption appropriate? Is there a linear relationship between the loss value of a 1 degree error, 5 degree error and 50 degree error? Perhaps when considering redundancy management we should work backwards from the loss function to a voting algorithm that better matches it?