Archives For Complexity

Complexity, what is, how do we deal with it, and how does it contribute to risk.

A small question for the ATSB

According to the preliminary ATSB report the crew of QF32 took approximately 50 minutes to process all the Electronic Centralised Aircraft Monitor (ECAM) messages. This was despite this normal crew of three being augmented by a check captain in training and a senior check captain.

Continue Reading…

Back in 1999 I co-authored this paper with Darren Burrowes a colleague of mine on the ADI Minehunter project to capture some of what we’d learned about emergent design attributes and their management on that project. Darren got to present the paper at INCOSE’s International Symposium in Brighton England 1999.

Continue Reading...

Soviet Shuttle was safer by design

According to veteran russian cosmonaut Oleg Kotov, quoted in a New Scientist article the soviet Buran shuttle (1) was much safer than the American shuttle due to fundamental design decisions. Kotov’s comments once again underline the importance to safety of architectural decisions in the early phases of a design.

Continue Reading…

Because they have typically pitch unity ratios (1:1) scales, aircraft primary flight displays provide a pitch display that is limited by the vertical field of view. This display can move very rapidly and be difficult to use in unusual attitude recoveries becoming another adverse performance shaping factor for aircrew in such a scenario. Trials by the USAF have conclusively demonstrated that an articulated style of pitch ladder can reduce disorientation of aircrew in such situations.

Continue Reading...

Why more information does not automatically reduce risk

I recently re-read the article Risks and Riddles by Gregory Treverton on the difference between a puzzle and a mystery. Treverton’s thesis, taken up by Malcom Gladwell in Open Secrets, is that there is a significant difference between puzzles, in which the answer hinges on a known missing piece, and mysteries in which the answer is contingent upon information that may be ambiguous or even in conflict. Continue Reading…

The past is prologue to the present

I’m currently reading a report prepared by MIT’s Human and Automation Labs on a conceptual design for the Altair lunar lander’s human machine interface. Continue Reading…

Recent work in complexity and robustness theory for engineered systems has highlighted that the architecture with which these systems are designed inherently leads to ‘robust yet fragile’ behavior. This vulnerability has strong implications for the human operator when he or she is expected to intervene in response to the failure of system.

Continue Reading...

The right attitude

27/05/2011

How the design of the Apollo Command Module Attitude Reference Indicator illustrates the importance of cultural cliches or precedents in coordinating human and software behaviour.

Continue Reading...

A UAV and COMAIR near miss over Kabul illustrates the problem of emergent hazards when we integrate systems or operate existing systems in operational contexts not considered by their designers.

Continue Reading...

For those interested the interim report by Mike Weightman, the UK’s Inspector of Nuclear Installations, on lessons from Fukushima has been released.

Continue Reading...

A near disaster in space 40 years ago serves as a salutory lesson on Common Cause Failure (CCF)

Two days after the launch of Apollo 13 an oxygen tank ruptured crippling the Apollo service module upon which the the astronauts depended for survival, precipitating a desperate life or death struggle for survival. But leaving aside what was possibly NASA’s finest hour, the causes of this near disaster provide important lessons for design damage resistant architectures.

Continue Reading…

Blayais Plant (Image source: Wikipedia Commons)

What a near miss flooding incident at a French nuclear plant in 1999 and the Fukushima 2012 disaster can tell us about fault tolerance and designing for reactor safety

Continue Reading…

How driver training problems for the M113 Armoured Personnel Carrier provide and insight into the ecology of interface design.

Continue Reading...

Reflections on design errors in the human machine interface

Having recently bought a new car I was driving home and noticed that the illuminated lighting controls were reflected in the right hand wing mirror. See the picture below for the effect. These sort of reflections are at best annoying, but in the worst case they could mask the lights of a car in the right hand lane and lead to a side swipe during lane changing.

Continue Reading…

Fukushima NPP March 17 (Image Source: AP)

There are few purely technical problems…

The Washington Post has discovered that concerns about the vulnerability of the Daiichi Fukushima plant to potential Tsunami events were brushed aside at a review of nuclear plant safety conducted in the aftermath of the Kobe earthquake. Yet at other plants the Japanese National Institute of Advanced Industrial Science and Technology (NISA) had directed the panel of engineers and geologists to consider tsunami events.

Continue Reading…

QF32 Redux

29/03/2011

QF32 - No. 1 engine failure to shutdown

The ABC’s treatment of the QF 32 incident treads familiar and slightly disappointing ground

While I thought that the ABC 4 Corners programs treatment of the QF 32 incident was a creditable effort I have to say that I was unimpressed by the producers homing in on a (presumed) Rolls Royce production error as the casus belli.

The report focused almost entirely upon the engine rotor burst and its proximal cause but failed to discuss (for example) the situational overload introduced by the ECAM fault reporting, or for that matter why a single rotor burst should have caused so much cascading damage and so nearly led to the loss of the aircraft.

Overall two out of four stars 🙂

If however your interested in a discussion of the deeper issues arising from this incident then see:

  1. Lessons from QF32. A discussion of some immediate lessons that could be learned from the QF 32 accident;
  2. The ATSB QF32 preliminary report. A commentary on the preliminary report and its strengths and weaknesses;
  3. Rotor bursts and single points of failure. A review and discussion of the underlying certification basis for commercial aircraft and protection from rotor burst events;
  4. Rotor bursts and single points of failure (Part II), Discusses differences between the damage sustained by QF 32 and that premised by a contemporary report issued by the AIA on rotor bursts;
  5. A hard rain is gonna fall. An analysis of 2006 American Airlines rotor burst incident that indicated problems with the FAA’s assumed rotor burst debris patterns; and
  6. Lies, damn lies and statistics. A statistical analysis, looking at the AIA 2010 report on rotor bursts and it’s underestimation of their risk.

20110122-093121

Why we may be carrying an order of magnitude greater risk from aircraft engine rotor bursts than we thought

One of the ‘implicit’ conclusions of the 2010 AIA study on the threat posed by jet engine rotor bursts was that the fleet of modern aircraft designed to meet FAA circular AC 20-128A also met the FAA established safety targets of a 1 in 20 likelihood of a catastrophic loss, in the event of a engine rotor burst.

Continue Reading…

On June 2, 2006, an American Airlines B767-223(ER), N330AA, equipped with General Electric (GE) CF6-80A engines experienced an uncontained failure of the high pressure turbine (HPT) stage 1 disk2 in the No. 1 (left) engine during a high-power ground run for maintenance at Los Angeles International Airport (LAX), Los Angeles, California.

To provide a better appreciation of aircraft level effects I’ve taken the NTBS summary description of the damage sustained by the aircraft and illustrated it with pictures taken of the accident by bystanders and technical staff.

Continue Reading...

QF 72 (Image Source: Terence Ong)

The QF 72 accident illustrates the significant effects that ‘small field’ decisions can have on overall system safety Continue Reading…

A report by the AIA on engine rotor bursts and their expected severity raises questions about the levels of damage sustained by QF 32.

Continue Reading...

Lessons from QF 32

06/11/2010

The recent Qantas QF32 engine failure illustrates the problems of dealing with common cause failure

This post is part of the Airbus aircraft family and system safety thread.

Updated: 15 Nov 2012

Generally the reason we have more than one of anything on a passenger aircraft is because we know that components can fail so independent redundancy is the cornerstone strategy to achieve the required levels of system reliability and safety. But while overall aircraft safety is predicated on the independence of these components, the reality is that the catastrophic failure of one component can also affect adjacent equipment and systems leading to what are termed common cause failures.

Continue Reading…

The Titanic effect

27/09/2010

So why did the Titanic sink? The reason highlights the role of implicit design assumptions in complex accidents and the interaction of design with operations of safety critical systems

Continue Reading...

The fallout from the QF 72 in flight accident has now reached the courts with Australian Aviation reporting that passengers and crew have taken up a joint class action against Airbus and Northrop Grumman (the manufacturer of the faulty Air Data Inertial Reference Unit).

Continue Reading...

The effective use by humans of any transport system is a critical success factor in the development of such systems. Careful consideration of the interaction of ergonomic and functional design with the physical and cognitive capabilities and limitations of crew, passengers and maintainers is essential to assure safe, effective and profitable rail operations.

Continue Reading...

The reality is that when pilots fly through an icing event or a driver steers through a skid the aircraft or car is not intelligent, the intelligence is actually in the head of the designer, the automation is merely his proxy.

Continue Reading...

Lead Tangara car damage (Source: Commission report)

On the 31st of January 2003 at approx. 7:14 am a four car Tangara passenger train on run C311 from Sydney Central to Port Kembla (G7) oversped on a downhill gradient leading into a curve and left the track. The train driver and six passengers were killed and the remaining passengers suffered various injuries ranging from minor bruising and lacerations to severe disabling injuries. Continue Reading…

Tweedle Dum and Dee (Image source: Wikipedia Commons)
How do ya do and shake hands, shake hands, shake hands. How do ya do and shake hands and state your name and business…

Lewis Carrol, Through the Looking Glass

You would have thought after the Leveson and Knight experiments that the  theory that independently written software would only contain independent faults was dead and buried, another beautiful theory shot down by hard cold fact.  But unfortunately like many great errors the theory of n-versioning keeps on keeping on (1).
Continue Reading…

Over the last couple of months I’ve posted on various incidents involving the Airbus A330 aircraft from the perspective of system safety. As these posts are scattered through my blog I thought I’d pull them together, the earliest post is at the bottom.

Continue Reading...

A330 Right hand AoA probes (Image source: ATSB)

I’ve just finished reading the ATSB’s second interim report on the the QF 72 in flight upset that resulted in two uncommaned pitch over events (1). In this accident one of the Air Data Inertial Reference Units (ADIRU) provided erroneous data in the form of transient spikes vales of the angle of attack AoA parameter to the flight control computers which then initiated two un-commanded extreme pitch overs.

This post is part of the Airbus aircraft family and system safety thread. Continue Reading…

One of the tenets of safety engineering is that simple systems are better. Many practical reasons are advanced to justify this assertion, but I’ve always wondered what, if any, theoretical justification was there for such a position.

Continue Reading...

So far as we know flight AF 447 fell out of the sky with its systems performing as their designers had specified, if not how they expected, right up-to the point that it impacted the surface of the ocean.

So how is it possible that incorrect air data could simultaneously cause upsets in aircraft functions as disparate as engine thrust management, flight law protection and traffic avoidance?

Continue Reading...

The use of median value voting algorithms as part of fault tolerant design has become an almost ubiquitous design solution, especially for avionics systems. But have we really considered their suitability?

Continue Reading...

The TCAS II specification credibility window can provide us with an insight into the magnitude initial unreliable air data parameters in the AF 447 disaster.

Continue Reading...

Pitot sensor (Source: BEA)

The theory of Highly Optimised Tolerance (HOT) predicts that as technological systems evolve to become more robust to common perturbations they still remain vulnerable to rare events (Carlson, Doyle 2002) and this theory may give us an insight into the performance of modern integrated air data systems in the face of in-flight icing incidents. 

Continue Reading…

Invalid air data may have triggered the cabin pressure differential safety function on AF 447.

Continue Reading...

A cross walk of the interim investigation accident reports issued by the ATSB and BEA for the QF72 and AF447 accidents respectively shows that in both accidents the inertial reference units that are part of the onboard air data inertial reference unit (ADIRU) that exhibited anomalous behaviour also declared a failure. Why did this occur?

Continue Reading...

The HAL effect

09/09/2009

Do we automate our cultural biases, and can this have an affect upon the safe coordination of crew and automation?

Continue Reading...

Authors Note. Below is my original post on the potential causes of the AF 447 cabin altitude advisory, I concluded that there were a number of potential causes one of which could be an erroneous altitude input from the ADIRU. What I didn’t consider was that the altitude advisory could have been triggered by correct operation of the cabin pressure control system, see  The AF 447 cabin vertical speed advisory and Pt II for more on this.

The last ACARS transmision received from AF 447 was the ECAM advisory that the cabin altitude (pressure) variation had exceeded 1,800 ft/min for greater than 5 seconds. While some commentators have taken this message to indicate that the aircraft had suffered a catastrophic structural failure, all we really know is that at that point there was a rapid change in reported cabin altitude. Given the strong indications of unreliable air data from other on-board systems, perhaps it’s worthwhile having a look for other potential causes of such rapid cabin pressure changes.

Continue Reading…

TCAS Indicator (Image Source: Public Domain)

What TCAS can tell us about AF447 (Updated 27 Sept 09)

The BEA interim report on the AF447 accident confirms that the Traffic Alert and Collision Avoidance System (TCAS) had become inoperative during the early part of the event sequence for an, as yet, un-identified reason. The explanation may actually be fairly straight forward and lie within the fault tolerance requirements of the TCAS specification. Continue Reading…

Flaws in the glass

19/07/2009

DO-178B and the B-777 9M-MRG Incident

In August 2005 a Boeing 777 experienced an in-flight upset caused by the aircraft’s Air Data Inertial Reference Unit (ADIRU), generating erroneous acceleration data. The software fault that caused this upset raises questions in turn about the DO-178 software development process. A subsequent investigation of the accident by the Australian Transportation Board (ATSB) identified that the following had occured:

  • accelerometer #5 failed on the first of June in a false high value output mode,
  • the ADIRU excluded accelerometer #5 from use in its computations,
  • the ADIRU unit remained in service with this failed component (1),
  • power to the ADIRU was cycled (causing a system reset),
  • accelerometer #6 then failed in-flight,
  • accelerometer #6 was excluded from use by the ADIRU,
  • the ADIRU then re-admitted accelerometer #5 into its computations, and
  • erroneous acceleration values were output to the flight computer.

Continue Reading…

Reading the ATSB interim report on the QF72 in flight accident one could easily overlook the statement, “…the crew reported that the (ECAM (1)) messages were constantly scrolling, and they could not effectively interact with the ECAM to action and/or clear the messages.”. So why did the A330 ECAM display fail during such a critical event?

Continue Reading...

If the theory of Highly Optimised Tolerance (HOT) theory holds true then we should be able to see a change in the distribution of the severity of adverse events as the design paradigm for a family of systems moves from the, ‘just make it work’ stage to the ‘optimise for robustness’ stage. This is something we can actually test through observation of real world systems.

Continue Reading...

The statement by, AirBus regarding the robustness of the AirBus AOA voting logic disclosed in the ATSB QF72 accident report raises some interesting questions as to what was actually meant by the term robustness.

Continue Reading...

The effect of poorly considered originating requirements (the recommendations of the Waterfall accident commisioner) upon system safety requirements for a passenger emergency door release function.

Continue Reading...

The QF 72 in flight pitch upset demonstrates the vulnerability of a redundant and presumed fault tolerant systems to situations where the real world does not accord with the assumptions made by designers.

Continue Reading...

One of the interesting features about real world epidemics (and pandemics) is that they don’t follow a nice smooth logistics curve (the classic s shape) of a period of slow growth followed by an explosive growth phase then eventually plateauing out in the final burnout phase.

Continue Reading...

Small worlds and pandemics. Thinking about pandemics using network theory.

Continue Reading...