…it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.
Archives For Technology
The technological aspects of engineering for high consequence systems.
It is highly questionable whether total system safety is always enhanced by allocating functions to automatic devices rather than human operators, and there is some reason to believe that flight-deck automation may have already passed its optimum point.
So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?
In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:
- Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
- Provide a fallback AoA source using an AoA estimator.
- Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
- Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
- Use non-aristotlean logic to better model the trustworthiness of air data.
- Provide the current master/slave hierarchy status amongst voting channels to aircrew.
- Provide an obvious and intuitive way to to remove a faulted channel allowing flight under reversionary laws (7).
- Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).
As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.
1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.
2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.
3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.
4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).
5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.
6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.
7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.
8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.
The NBN, an example of degraded societal resilience?
Why writing a safety case might (actually) be a good idea
Frequent readers of my blog would probably realise that I’m a little sceptical of safety cases, as Scrooge remarked to Morely’s ghost, “There’s more of gravy than of grave about you, whatever you are!” So to for safety cases, oft more gravy than gravitas about them in my opinion, regardless of what their proponents might think.
It is a common requirement to either load or update applications over the air after a distributed system has been deployed. For embedded systems that are mass market this is in fact a fundamental necessity. Of course once you do have an ability to load remotely there’s a back door that you have to be concerned about, and if the software is part of a vehicle’s control system or an insulin pump controller the consequences of leaving that door unsecured can be dire. To do this securely requires us to tackle the insecurities of the communications protocol head on.
One strategy is to insert a protocol ‘security layer’ between the stack and the application. The security layer then mediate between the application and the Stack to enforce the system’s overall security policy. For example the layer could confirm:
- that the software update originated from an authenticated source,
- that the update had not been modified,
- that the update itself had been authorised, and
- that the resources required by the downloaded software conform to any onboard safety or security policy.
There are also obvious economy of mechanism advantages when dealing with protocols like the TCP/IP monster. Who after all wants to mess around with the entirety of the TCP/IP stack, given that Richard Stevens took three volumes to define the damn thing? Similarly who wants to go through the entire process again when going from IP5 to IP6? 🙂
Defence in depth
One of the oft stated mantra’s of both system safety and cyber-security is that a defence in depth is required if you’re really serious about either topic. But what does that even mean? How deep? And depth of what exactly? Jello? Cacti? While such a statement has a reassuring gravitas, in practice it’s void of meaning unless you can point to an exemplar design and say there, that is what a defence in depth looks like. Continue Reading…
Unreliable airspeed events pose a significant challenge (and safety risk) because such situations throw onto aircrew the most difficult (and error prone) of human cognitive tasks, that of ‘understanding’ a novel situation. This results in a double whammy for unreliable airspeed incidents. That is the likelihood of an error in ‘understanding’ is far greater than any other error type, and having made that sort of error it’s highly likely that it’s going to be a fatal one. Continue Reading…
This post is part of the Airbus aircraft family and system safety thread.
While there’s often a lot of discussion about short term response of aircraft to control inputs, in practice it’s often the long term response of the aircraft state vector at constant thrust and neutral control inputs that’s just as important to flight control system designers. In the case of Airbus the selection by the designers of a modified C* feedback loop (1) for primary pitch axis control law (Airbus 1998) in flight has led to what you’d call interesting consequences. Continue Reading…
The Dreamliner and the Network
Big complicated technologies are rarely (perhaps never) developed by one organisation. Instead they’re a patchwork quilt of individual systems which are developed by domain experts, with the whole being stitched together by a single authority/agency. This practice is nothing new, it’s been around since the earliest days of the cybernetic era, it’s a classic tool that organisations and engineers use to deal with industrial scale design tasks (1). But what is different is that we no longer design systems, and systems of systems, as loose federations of entities. We now think of and design our systems as networks, and thus our system of systems have become a ‘network of networks’ that exhibit much greater degrees of interdependence.
While once again the media has whipped itself into a frenzy of anticipation over the objects sighted in the southern Indian ocean we should all be realistic about the likelihood of finding wreckage from MH370.
Separation of privilege and the avoidance of unpleasant surprises
Another post in an occasional series on how Saltzer and Schroeder’s eight principles of security and safety engineering seem to overlap in a number of areas, and what we might get from looking at safety with from a security perspective. In this post I’ll look at the concept of separation of privilege.
And not quite as simple as you think…
The testimony of Michael Barr, in the recent Oklahoma Toyota court case highlighted problems with the design of Toyota’s watchdog timer for their Camry ETCS-i throttle control system, amongst other things, which got me thinking about the pervasive role that watchdogs play in safety critical systems. The great strength of watchdogs is of course that they provide a safety mechanism which resides outside the state machine, which gives them fundamental design independence from what’s going on inside. By their nature they’re also simple and small scale beasts, thereby satisfying the economy of mechanism principle.
I’ve just finished reading the testimony of Phil Koopman and Michael Barr given for the Toyota un-commanded acceleration lawsuit. Toyota settled after they were found guilty of acting with reckless disregard, but before the jury came back with their decision on punitive damages, and I’m not surprised.
Or ‘On the breakdown of Bayesian techniques in the presence of knowledge singularities’
One of the abiding problems of safety critical ‘first of’ systems is that you face, as David Collingridge observed, a double bind dilemma:
- Initially an information problem because ‘real’ safety issues (hazards) and their risk cannot be easily identified or quantified until the system is deployed, but
- By the time the system is deployed you now face a power (inertia) problem, that is control or change is difficult once the system is deployed or delivered. Eliminating a hazard is usually very difficult and we can only mitigate them in some fashion. Continue Reading…
In 1996 the European Space Agency lost their brand new Ariane 5 launcher on it’s first flight. Here’s a recently updated annotated version of that report. I’d also note that the software that faulted was written using Ada a ‘strongly typed’ language, which does point to a few small problems with the use of such languages.
No, not the alternative name for this blog. 🙂
I’ve just given the post Pitch ladders and unusual attitude a solid rewrite adding some new material and looking a little more deeply at some of the underlying safety myths.
Boeing’s Dreamliner program runs into trouble with lithium ion batteries
Lithium batteries performance in providing lightweight, low volume power storage has made them a ubiquitous part of modern consumer life. And high power density also makes them attractive in applications, such as aerospace, where weight and space are at a premium. Unfortunately lithium batteries are also very unforgiving if operated outside their safe operating envelope and can fail in a spectacularly energetic fashion called a thermal runaway (1), as occurred in the recent JAL and ANA 787 incidents.
Why sometimes simpler is better in safety engineering.
Resilience and common cause considered in the wake of hurricane Sandy
One of the fairly obvious lessons from Hurricane Sandy is the vulnerability of underground infrastructure such as subways, road tunnels and below grade service equipment to flooding events.
The New York City subway system is 108 years old, but it has never faced a disaster as devastating as what we experienced last night”
NYC transport director Joseph Lhota
Yet despite the obviousness of the risk we still insist on placing such services and infrastructure below grade level. Considering actual rises in mean sea level, e.g a 1 foot increase at Battery Park NYC since 1900, and those projected to occur this century perhaps now is the time to recompute the likelihood and risk of storm surges overtopping defensive barriers.
The following is an extract from Kevin Driscoll’s Murphy Was an Optimist presentation at SAFECOMP 2010. Here Kevin does the maths to show how a lack of exposure to failures over a small sample size of operating hours leads to a normalcy bias amongst designers and a rejection of proposed failure modes as ‘not credible’. The reason I find it of especial interest is that it gives, at least in part, an empirical argument to why designers find it difficult to anticipate the system accidents of Charles Perrow’s Normal Accident Theory. Kevin’s argument also supports John Downer’s (2010) concept of Epistemic accidents. John defines epistemic accidents as those that occur because of an erroneous technological assumption, even though there were good reasons to hold that assumption before the accident. Kevin’s argument illustrates that engineers as technological actors must make decisions in which their knowledge is inherently limited and so their design choices will exhibit bounded rationality.
In effect the higher the dependability of a system the greater the mismatch between designer experience and system operational hours and therefore the tighter the bounds on the rationality of design choices and their underpinning assumptions. The tighter the bounds the greater the effect of congnitive biases will have, e.g. such as falling prey to the Normalcy Bias. Of course there are other reasons for such bounded rationality, see Logic, Mathematics and Science are Not Enough for a discussion of these.
I just realised that I’ve used the term ‘design hypothesis’ throughout this blog without a clear definition of what one is. 🙂
So here it is.
A design hypothesis is a prediction that a specific design will result in a specific outcome. A design hypothesis must:
- Identify the designs provenance, e.g the theory, practice or standards from it is derived.
- Provide a concise description of the design.
- State what the design must achieve in a verifiable fashion.
- Clearly identify critical assumptions that support the hypothesis.
Note that the concept of a fault hypothesis can be seen as a particular and constrained form of design hypothesis as, after Powell (1992), a fault hypothesis specifies assumptions about the types of faults, the rate at which components fail and how components may fail for fault tolerant computing purposes.
I give a short example of of a design hypothesis in the Titanic Part I post.
Powell, D., Failure mode assumptions and assumption coverage. In Proc. of the 22nd IEEE Annual International Symposium on Fault-Tolerant Computing (FTCS-22) , p386–395, Boston, USA, 1992.
Just finished giving my post on Lessons from Nuclear Weapons Safety a rewrite.
The original post is, as the title implies, about what we can learn from the principled base approach to safety adopted by the US DOE nuclear weapons safety community. Hopefully the rewrite will make it a little clearer, I can be opaque as a writer sometimes. 🙂
P.S. I probably should look at integrating the 3I principles introduced into this post on the philosophy of safety critical systems.
One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?
This post is part of the Airbus aircraft family and system safety thread.
I’m currently reading Richard de Crespigny’s book on flight QF 32. In he writes that he felt at one point that he was being over whelmed by the number and complexity of ECAM messages. At that moment he recalled remembering a quote from Gene Kranz, NASA’s flight director, of Apollo 13 fame, “Hold it Gentlemen, Hold it! I don’t care about what went wrong. I need to know what is still working on that space craft.”.
The crew of QF32 are not alone in experiencing the overwhelming flood of data that a modern control system can produce in a crisis situation. Their experience is similar to that of the operators of the Three Mile island nuclear plant who faced a daunting 100+ near simultaneous alarms, or more recently the experiences of QF 72.
The take home point for designers is that, if you’ve carefully constructed a fault monitoring and management system you also need to consider the situation where the damage to the system is so severe that the needs of the operator invert and they need to know ‘what they’ve still got’, rather that what they don’t have.
The term ‘never give up design strategy’ is bandied around in the fault tolerance community, the above lesson should form at least a part of any such strategy.
In June of 2011 the Australian Safety Critical Systems Association (ASCSA) published a short discussion paper on what they believed to be the philosophical principles necessary to successfully guide the development of a safety critical system. The paper identified eight management and eight technical principles, but do these principles do justice to the purported purpose of the paper?
The MIL-STD-882 lexicon of hazard analyses includes the System Hazard Analysis (analysis) which according to the standard is intended to:
“…examines the interfaces between subsystems. In so doing, it must integrate the outputs of the SSHA. It should identify safety problem areas of the total system design including safety critical human errors, and assess total system risk. Emphasis is placed on examining the interactions of the subsystems.”
This sounds reasonable in theory and I’ve certainly seen a number toy examples touted in various text books on what it should look like. But, to be honest, I’ve never really been convinced by such examples, hence this post.
One of the canonical design principles of the nuclear weapons safety community is to base the behaviour of safety devices upon fundamental physical principles. For example a nuclear weapon firing circuit might include capacitors in the firing circuit that, in the event of a fire, will always fail to open circuit thereby safing the weapon. The safety of the weapon in this instance is assured by devices whose performance is based on well understood and predictable material properties.
In an article published in the online magazine Spectrum Eliza Strickland has charted the first 24 hours at Fukushima. A sobering description of the difficulty of the task facing the operators in the wake of the tsunami.
Her article identified a number of specific lessons about nuclear plant design, so in this post I thought I’d look at whether more general lessons for high consequence system design could be inferred in turn from her list.
In an earlier post I commented that in the QF72 incident the use of a geometric mean (1) instead of the arithmetic mean when calculating the aircrafts angle of attack would have reduced the severity of the subsequent pitch over. Which leads into the more general subject of what to do when the real world departs from our assumption about the statistical ‘well formededness’ of data. The problem, in the case of measuring angle of attack on commercial aircraft, is that the left and right alpha sensors are not truly independent measures of the same parameter (2). With sideslip we cannot directly obtain a true angle of attack (AoA) from any single sensor (3) so need to take the average (mean) of the measured AoA on either side of the fuselage (Gracey 1958) to determine the true AoA. Because of this variance between left and right we cannot use a median voting approach, as we can expect the two sensors the right side to differ from the one sensor on the left. As a result we end up having to use the mean of two sensor values (one from each side) as an estimate of the resultant central tendency.
I’ve recently been reading John Downer on what he terms the Myth of Mechanical Objectivity. To summarise John’s argument he points out that once the risk of an extreme event has been ‘formally’ assessed as being so low as to be acceptable it becomes very hard for society and it’s institutions to justify preparing for it (Downer 2011).
Why We Automate Failure
A recent post on the interface issues surrounding the use of side-stick controllers in current generation passenger aircraft led me to think more generally about the the current pre-eminence of software driven visual displays and why we persist in their use even though there may be a mismatch between what they can provide and what the operator needs.
The Mississippi River’s Old River Control Structure, a National Single Point of Failure?
Given the recent events in Fukushima and our subsequent western cultural obsession with the radiological consequences, perhaps it’s appropriate to reflect on other non-nuclear vulnerabilities. A case in point is the Old River Control Structure erected by those busy chaps the US Army Corp of Engineers to control the path of the Mississippi to the sea. Well as it turns out trapping the Mississippi wasn’t really such a good idea…
How the marking of a traffic speed hump provides a classic example of a false affordance and an unintentional hazard.Continue Reading...
According to veteran russian cosmonaut Oleg Kotov, quoted in a New Scientist article the soviet Buran shuttle (1) was much safer than the American shuttle due to fundamental design decisions. Kotov’s comments once again underline the importance to safety of architectural decisions in the early phases of a design.
I attended the annual Rail Safety conference for 2011 earlier in the year and one of the speakers was Group capt Alan Clements, the Director Defence Aviation Safety and Air Force Safety. His presentation was interesting in both where the ADO is going with their aviation safety management system as well as providing some historical perspective, and statistics.Continue Reading...
Just discovered a paper I co-authored for the 2006 AIAA Reno Conference on the Risk & Safety Aspects of Systems of Systems. A little disjointed but does cover some interesting problem areas for systems of systems.
A UAV and COMAIR near miss over Kabul illustrates the problem of emergent hazards when we integrate systems or operate existing systems in operational contexts not considered by their designers.Continue Reading...
A near disaster in space 40 years ago serves as a salutory lesson on Common Cause Failure (CCF)
Two days after the launch of Apollo 13 an oxygen tank ruptured crippling the Apollo service module upon which the the astronauts depended for survival, precipitating a desperate life or death struggle for survival. But leaving aside what was possibly NASA’s finest hour, the causes of this near disaster provide important lessons for design damage resistant architectures.
What a near miss flooding incident at a french reactor plant in 1999, it’s aftermath and the subsequent Fukushima plant disaster can tell us about fault tolerance and designing for reactor safety.Continue Reading...
The ABC’s treatment of the QF 32 incident treads familiar and slightly disappointing ground
While I thought that the ABC 4 Corners programs treatment of the QF 32 incident was a creditable effort I have to say that I was unimpressed by the producers homing in on a (presumed) Rolls Royce production error as the casus belli.
The report focused almost entirely upon the engine rotor burst and its proximal cause but failed to discuss (for example) the situational overload introduced by the ECAM fault reporting, or for that matter why a single rotor burst should have caused so much cascading damage and so nearly led to the loss of the aircraft.
Overall two out of four stars 🙂
If however your interested in a discussion of the deeper issues arising from this incident then see:
- Lessons from QF32. A discussion of some immediate lessons that could be learned from the QF 32 accident;
- The ATSB QF32 preliminary report. A commentary on the preliminary report and its strengths and weaknesses;
- Rotor bursts and single points of failure. A review and discussion of the underlying certification basis for commercial aircraft and protection from rotor burst events;
- Rotor bursts and single points of failure (Part II), Discusses differences between the damage sustained by QF 32 and that premised by a contemporary report issued by the AIA on rotor bursts;
- A hard rain is gonna fall. An analysis of 2006 American Airlines rotor burst incident that indicated problems with the FAA’s assumed rotor burst debris patterns; and
- Lies, damn lies and statistics. A statistical analysis, looking at the AIA 2010 report on rotor bursts and it’s underestimation of their risk.
On June 2, 2006, an American Airlines B767-223(ER), N330AA, equipped with General Electric (GE) CF6-80A engines experienced an uncontained failure of the high pressure turbine (HPT) stage 1 disk2 in the No. 1 (left) engine during a high-power ground run for maintenance at Los Angeles International Airport (LAX), Los Angeles, California.
To provide a better appreciation of aircraft level effects I’ve taken the NTBS summary description of the damage sustained by the aircraft and illustrated it with pictures taken of the accident by bystanders and technical staff.Continue Reading...