Archives For Fault tolerance

So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?

In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:

  1. Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
  2. Provide a fallback AoA source using an AoA estimator.
  3. Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
  4. Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
  5. Use non-aristotlean logic to better model the trustworthiness of air data.
  6. Provide the current master/slave hierarchy status amongst voting channels to aircrew.
  7. Provide an obvious and intuitive way to  to remove a faulted channel allowing flight under reversionary laws (7).
  8. Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).

As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.

Notes

1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.

2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.

3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.

4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).

5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.

6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.

7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.

8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.

Here’s a companion tutorial to the one on integrity level partitioning. This addresses more general software hazards and how to deal with them. Again you can find a more permanent link on my publications page. Enjoy 🙂

Unreliable airspeed events pose a significant challenge (and safety risk) because such situations throw onto aircrew the most difficult (and error prone) of human cognitive tasks, that of ‘understanding’ a novel situation. This results in a double whammy for unreliable airspeed incidents. That is the likelihood of an error in ‘understanding’ is far greater than any other error type, and having made that sort of error it’s highly likely that it’s going to be a fatal one. Continue Reading…

And not quite as simple as you think…

The testimony of Michael Barr, in the recent Oklahoma Toyota court case highlighted problems with the design of Toyota’s watchdog timer for their Camry ETCS-i  throttle control system, amongst other things, which got me thinking about the pervasive role that watchdogs play in safety critical systems. The great strength of watchdogs is of course that they provide a safety mechanism which resides outside the state machine, which gives them fundamental design independence from what’s going on inside. By their nature they’re also simple and small scale beasts, thereby satisfying the economy of mechanism principle.

Continue Reading…

Toyota ECM (Image source: Barr testimony presentation)Economy of mechanism and fail safe defaults

I’ve just finished reading the testimony of Phil Koopman and Michael Barr given for the Toyota un-commanded acceleration lawsuit. Toyota settled after they were found guilty of acting with reckless disregard, but before the jury came back with their decision on punitive damages, and I’m not surprised.

Continue Reading…

Singularity (Image source: Tecnoscience)

Or ‘On the breakdown of Bayesian techniques in the presence of knowledge singularities’

One of the abiding problems of safety critical ‘first of’ systems is that you face, as David Collingridge observed, a double bind dilemma:

  1. Initially an information problem because ‘real’ safety issues (hazards) and their risk cannot be easily identified or quantified until the system is deployed, but 
  2. By the time the system is deployed you now face a power (inertia) problem, that is control or change is difficult once the system is deployed or delivered. Eliminating a hazard is usually very difficult and we can only mitigate them in some fashion. Continue Reading…