Archives For Complexity

Complexity, what is, how do we deal with it, and how does it contribute to risk.

Tweedle Dum and Dee (Image source: Wikipedia Commons)

Revisiting the Knight, Leveson experiments

In the through the looking glass world of high integrity systems, the use of N-version programming is often touted as a means to achieve extremely lower failure rates without extensive V&V, due to the postulated independence of failure in independently developed software. Unfortunately this is hockum, as Knight and Leveson amply demonstrated with their N version experiments, but there may actually be advantages to N versioning, although not quite what the proponents of it originally expected.

Continue Reading…

System Safety Fundamentals Concept Cloud

There’s a very interesting site,  run by a couple of Australian lads, called Text is Beautiful that provides some free tools that allow you to visually represent the relationships within a text. No this isn’t the same as Wordle, these guys have gone beyond that to develop what they call a Concept cloud, colours in the Concept Cloud are indicative of distinct themes and themes themselves represent rough groupings of related concepts. What’s a concept? Well a concept is made up of several words, with each concept having it’s own unique thesaurus that is statistically derived from the text.

So without further ado I took the Fundamentals of System Safety course that I teach and dropped it in the hopper, the results as you might guess are above. Very neat to look at and it also gives an interesting insight into how the concepts that the course teaches interrelate. Enjoy. :)

Well I can’t believe I’m saying this but those happy clappers of the software development world, the proponents of Agile, Scrum and the like might (grits teeth), actually, have a point. At least when it comes to the development of novel software systems in circumstances of uncertainty, and possibly even for high assurance systems.

Continue Reading…

Mars code: JPL and risk based design

Linguistic security, and the second great crisis of computing

Distributed systems need to communicate, or talk, through some sort of communications channel in order to achieve coordinated behaviour which introduces the need for components to firstly recognise the difference between valid and invalid messages and secondly to have a common set of expectation of behaviour. And fairly obviously these two problems of coordination have safety and security implications of course.

The problem is that up to now security has been framed in the context of code, but this approach fails to realise that recognition and context are essentially language problems, which brings us firstly to the work of Chomsky on languages and next to Turing on computation. As it turns out above a certain level of expressive power of a language in the Chomsky hierarchy figuring out whether an input is valid runs into the halting problem of Turing. For such expressively powerful languages the question, ‘is it valid?’ is simply undecidable, no matter how hard you try. This is an important point, it’s not just hard or even really really hard to do but actually undecidable so…don’t go there.

Enter the study of linguistic security to address the vulnerabilities introduced by the to date unrecognised expressive power of the languages we communicate with.

Continue Reading…

20140122-072236.jpg

The failure of NVP and the likelihood of correlated security exploits

In 1986, John Knight & Nancy Leveson conducted an experiment to empirically test the assumption of independence in N version programming. What they found was that the hypothesis of independence of failures in N-version programs could be rejected at a 99% confidence level. While their results caused quite a stir in the software community, see their A reply to the critics for a flavour, what’s of interest to me is what they found when they took a closer look at the software faults.

…approximately one half of the total software faults found involved two or more programs. This is surprisingly high and implies that either programmers make a large number of similar faults or, alternatively, that the common faults are more likely to remain after debugging and testing.

Knight, Leveson 1986

Continue Reading…

The kettle of doom

20/12/2013 — 1 Comment

My thanks to Charlie Stross for alerting us all to the unfortunate incident of the Russian kettle, bugged with malware intended to find unsecured Wi-fi networks and co-opt them into a zombie bot net (1).

Now Charlie’s take on this revolves around the security/privacy implications for the ‘Internet of Things’ movement, making everything smart and web savvy may sound really cool, but not if your toaster ends up spying on you, a creepy little fore-taste of the panopticon future.

Continue Reading…

Toyota ECM (Image source: Barr testimony presentation)

Economy of mechanism and fail safe defaults

I’ve just finished reading the testimony of Phil Koopman and Michael Barr given for the Toyota un-commanded acceleration lawsuit. Toyota settled after they were found guilty of acting with reckless disregard, but before the jury came back with their decision on punitive damages, and I’m not surprised.

Continue Reading…

Singularity (Image source:  Tecnoscience)

Or ‘On the breakdown of Bayesian techniques in the presence of knowledge singularities’

One of the abiding problems of safety critical ‘first of’ systems is that you face, as David Collingridge observed, a double bind dilemma:

  1. Initially an information problem because ‘real’ safety issues (hazards) and their risk cannot be easily identified or quantified until the system is deployed, but 
  2. By the time the system is deployed you now face a power (inertia) problem, that is control or change is difficult once the system is deployed or delivered. Eliminating a hazard is usually very difficult and we can only mitigate them in some fashion.

    Continue Reading…

BMW HUD concept (Image source: BMW) Those who cannot remember the past of human factors are doomed to repeat it…

With apologies to the philosopher George Santayana, I’ll make the point that the BMW Head Up Display technology is in fact not the unalloyed blessing premised by BMW in their marketing material.

Continue Reading…

New Battery boxes (Image source: Boeing)

The end of the matter…well almost

Continue Reading…

No, not the alternative name for this blog. :)

I’ve just given the post Pitch ladders and unusual attitude a solid rewrite adding some new material and looking a little more deeply at some of the underlying safety myths.

X-Ray of JAL Battery (Image Source: NTSB)

A bit more on Boeing’s battery woes…

The NTSB has released more pictures of the JAL battery, and there are some interesting conclusions that can be drawn from the evidence to date.

Continue Reading…


JAL JA829J Fire (Image Source: Stephan Savoia AP Photo)

Boeing’s Dreamliner program runs into trouble with lithium ion batteries

Lithium batteries performance in providing lightweight, low volume power storage has made them a ubiquitous part of modern consumer life. And high power density also makes them attractive in applications, such as aerospace, where weight and space are at a premium. Unfortunately lithium batteries are also very unforgiving if operated outside their safe operating envelope and can fail in a spectacularly energetic fashion called a thermal runaway (1), as occurred in the recent JAL and ANA 787 incidents.

Continue Reading…

Buncefield Tank on Fire (Image Source: Royal Chiltern Air Support Unit)

Why sometimes simpler is better in safety engineering.

Continue Reading…

While reading the 2006 Buncefield investigation report I came across this interesting statement.

“Such sensors are in widespread use and a number are available that have been certified for use in SIL2/3 applications in accordance with BS EN 61511 (1) .”

Buncefield Major Incident Investigation Report, Volume 2 Annex 4, p 28 (2006).

Continue Reading…

I was thinking about how the dubious concept of ‘safety integrity levels’ continues to persist in spite of protracted criticism. in essence if the flaws in the concept of SILs are so obvious why they still persist?

Continue Reading…

Resilience and common cause considered in the wake of hurricane Sandy

One of the fairly obvious lessons from Hurricane Sandy is the vulnerability of underground infrastructure such as subways, road tunnels and below grade service equipment to flooding events.

The New York City subway system is 108 years old, but it has never faced a disaster as devastating as what we experienced last night”

NYC transport director Joseph Lhota

Yet despite the obviousness of the risk we still insist on placing such services and infrastructure below grade level. Considering actual rises in mean sea level, e.g a 1 foot increase at Battery Park NYC since 1900, and those projected to occur this century perhaps now is the time to recompute the likelihood and risk of storm surges overtopping defensive barriers.

Continue Reading…

Warsaw A320 Accident (Image Source: Unknown)

One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?

Continue Reading…

So what do gambling, thermodynamics and risk all have in common?

Continue Reading...

I’m currently reading Richard de Crespigny’s book on flight QF 32. In he writes that he felt at one point that he was being over whelmed by the number and complexity of ECAM messages. At that moment he recalled remembering a quote from Gene Kranz, NASA’s flight director, of Apollo 13 fame, “Hold it Gentlemen, Hold it! I don’t care about what went wrong. I need to know what is still working on that space craft.”.

The crew of QF32 are not alone in experiencing the overwhelming flood of data that a modern control system can produce in a crisis situation. Their experience is similar to that of the operators of the Three Mile island nuclear plant who faced a daunting 100+ near simultaneous alarms, or more recently the experiences of QF 72.

The take home point for designers is that, if you’ve carefully constructed a fault monitoring and management system you also need to consider the situation where the damage to the system is so severe that the needs of the operator invert and they need to know ‘what they’ve still got’, rather that what they don’t have.

The term ‘never give up design strategy’ is bandied around in the fault tolerance community, the above lesson should form at least a part of any such strategy.

This post is part of the Airbus aircraft family and system safety thread.

For those of you interested in such things here’s a link to a draft copy of Professor Nancy Leveson’s latest book on system safety Engineering a Safer World, and her STAMP methodology.

Like Safeware it looks to become another classic of the system safety canon.

Here’s a draft of my latest paper to be presented at the Congress of Rail Engineering (CORE 2012) this year in Brisbane. This is more of a mainstream systems engineering paper on the mechanics of writing specifications and some of the conceptual problems in doing so.

Continue Reading...

In an article published in the online magazine Spectrum Eliza Strickland has charted the first 24 hours at Fukushima. A sobering description of the difficulty of the task facing the operators in the wake of the tsunami.

Her article identified a number of specific lessons about nuclear plant design, so in this post I thought I’d look at whether more general lessons for high consequence system design could be inferred in turn from her list.

Continue Reading…

I’ve recently been reading John Downer on what he terms the Myth of Mechanical Objectivity. To summarise John’s argument he points out that once the risk of an extreme event has been ‘formally’ assessed as being so low as to be acceptable it becomes very hard for society and it’s institutions to justify preparing for it (Downer 2011).

Continue Reading…

Why We Automate Failure
A recent post on the interface issues surrounding the use of side-stick controllers in current generation passenger aircraft led me to think more generally about the the current pre-eminence of software driven visual displays and why we persist in their use even though there may be a mismatch between what they can provide and what the operator needs.

Continue Reading…

Airbuses side stick improves crew comfort and control, but is there a hidden cost?

The Airbus FBW side stick flight control has vastly improved the comfort of aircrew flying the Airbus fleet, much as the original Airbus designers predicted (Corps, 188). But the implementation also expresses the Airbus approach to flight control laws and that companies implicit assumption about the way in which humans interact with automation and each other. Here the record is more problematic.

Continue Reading…

Did the designers of the japanese seawalls consider all the factors?

In an eerie parallel with the Blayais nuclear power plant flooding incident it appears that the designers of tsunami protection for the Japanese coastal cities and infrastructure hit by the 2011 earthquake did not consider all the combinations of environmental factors that go to set the height of a tsunami.

Continue Reading…

Why something as simple as control stick design can break an aircrew’s situational awareness

One of the less often considered aspects of situational awareness in the cockpit is the element of knowing what the ‘guy in the other seat is doing’. This is a particularly important part of cockpit error management because without a shared understanding of what someone is doing it’s kind of difficult to detect errors.

Continue Reading…

Air France Tail plane (Image Source: Agencia Brasil CCLA 2.5)

Requirements completeness and the AF447 stall warning

Reading through the BEA’s precis of the data contained on Air France’s AF447 Flight Data Recorder you find that during the final minutes of AF447 the aircrafts stall warning ceased, even though the aircraft was still stalled, thereby removed a significant cue to the aircrew that they had flown the aircraft into a deep stall.

Continue Reading…

One of the areas of human factors in design is the physical layout of a seated workstation or control console to suit the functional reach capabilities of the user population. Should be simple right? Wrong.

Continue Reading...

Good and bad in the design of an Oliver Hazard Perry class frigates ECS propulsion control console HMI.

Continue Reading...

A small question for the ATSB

According to the preliminary ATSB report the crew of QF32 took approximately 50 minutes to process all the Electronic Centralised Aircraft Monitor (ECAM) messages. This was despite this normal crew of three being augmented by a check captain in training and a senior check captain.

Continue Reading…

Back in 1999 I co-authored this paper with Darren Burrowes a colleague of mine on the ADI Minehunter project to capture some of what we’d learned about emergent design attributes and their management on that project. Darren got to present the paper at INCOSE’s International Symposium in Brighton England 1999.

Continue Reading...

Soviet Shuttle was safer by design

According to veteran russian cosmonaut Oleg Kotov, quoted in a New Scientist article the soviet Buran shuttle (1) was much safer than the American shuttle due to fundamental design decisions. Kotov’s comments once again underline the importance to safety of architectural decisions in the early phases of a design.

Continue Reading…

Because they have typically pitch unity ratios (1:1) scales, aircraft primary flight displays provide a pitch display that is limited by the vertical field of view. This display can move very rapidly and be difficult to use in unusual attitude recoveries becoming another adverse performance shaping factor for aircrew in such a scenario. Trials by the USAF have conclusively demonstrated that an articulated style of pitch ladder can reduce disorientation of aircrew in such situations.

Continue Reading...

Why more information does not automatically reduce risk

I recently re-read the article Risks and Riddles by Gregory Treverton on the difference between a puzzle and a mystery. Treverton’s thesis, taken up by Malcom Gladwell in Open Secrets, is that there is a significant difference between puzzles, in which the answer hinges on a known missing piece, and mysteries in which the answer is contingent upon information that may be ambiguous or even in conflict. Continue Reading…

The past is prologue to the present

I’m currently reading a report prepared by MIT’s Human and Automation Labs on a conceptual design for the Altair lunar lander’s human machine interface. Continue Reading…

Recent work in complexity and robustness theory for engineered systems has highlighted that the architecture with which these systems are designed inherently leads to ‘robust yet fragile’ behavior. This vulnerability has strong implications for the human operator when he or she is expected to intervene in response to the failure of system.

Continue Reading...

The Right Attitude

27/05/2011 — 1 Comment

How the design of the Apollo Command Module Attitude Reference Indicator illustrates the importance of cultural cliches or precedents in coordinating human and software behaviour.

Continue Reading...

A UAV and COMAIR near miss over Kabul illustrates the problem of emergent hazards when we integrate systems or operate existing systems in operational contexts not considered by their designers.

Continue Reading...

For those interested the interim report by Mike Weightman, the UK’s Inspector of Nuclear Installations, on lessons from Fukushima has been released.

Continue Reading...

A near disaster in space 40 years ago serves as a salutory lesson on common cause failure

Two days after the launch of Apollo 13 an oxygen tank ruptured crippling the Apollo service module upon which the the astronauts depended for survival, precipitating a desperate life or death struggle for survival. But leaving aside what was possibly NASA’s finest hour, the causes of this near disaster provide important lessons for design damage resistant architectures.

Continue Reading…

Blayais Plant (Image source: Wikipedia Commons)

What a near miss flooding incident at a French nuclear plant in 1999 and the Fukushima 2012 disaster can tell us about fault tolerance and designing for reactor safety

Continue Reading…

How driver training problems for the M113 Armoured Personnel Carrier provide and insight into the ecology of interface design.

Continue Reading...

Reflections on design errors in the human machine interface

Having recently bought a new car I was driving home and noticed that the illuminated lighting controls were reflected in the right hand wing mirror. See the picture below for the effect. These sort of reflections are at best annoying, but in the worst case they could mask the lights of a car in the right hand lane and lead to a side swipe during lane changing.

Continue Reading…

Fukushima NPP March 17 (Image Source: )

There are few purely technical problems…

The Washington Post has discovered that concerns about the vulnerability of the Daiichi Fukushima plant to potential Tsunami events were brushed aside at a review of nuclear plant safety conducted in the aftermath of the Kobe earthquake. Yet at other plants the Japanese National Institute of Advanced Industrial Science and Technology (NISA) had directed the panel of engineers and geologists to consider tsunami events.

Continue Reading…

QF32 Redux

29/03/2011 — Leave a comment

QF32 - No. 1 engine failure to shutdown

The ABC’s treatment of the QF 32 incident treads familiar and slightly disappointing ground

While I thought that the ABC 4 Corners programs treatment of the QF 32 incident was a creditable effort I have to say that I was unimpressed by the producers homing in on a (presumed) Rolls Royce production error as the casus belli.

The report focused almost entirely upon the engine rotor burst and its proximal cause but failed to discuss (for example) the situational overload introduced by the ECAM fault reporting, or for that matter why a single rotor burst should have caused so much cascading damage and so nearly led to the loss of the aircraft.

Overall two out of four stars :)

If however your interested in a discussion of the deeper issues arising from this incident then see:

  1. Lessons from QF32. A discussion of some immediate lessons that could be learned from the QF 32 accident;
  2. The ATSB QF32 preliminary report. A commentary on the preliminary report and its strengths and weaknesses;
  3. Rotor bursts and single points of failure. A review and discussion of the underlying certification basis for commercial aircraft and protection from rotor burst events;
  4. Rotor bursts and single points of failure (Part II), Discusses differences between the damage sustained by QF 32 and that premised by a contemporary report issued by the AIA on rotor bursts;
  5. A hard rain is gonna fall. An analysis of 2006 American Airlines rotor burst incident that indicated problems with the FAA’s assumed rotor burst debris patterns; and
  6. Lies, damn lies and statistics. A statistical analysis, looking at the AIA 2010 report on rotor bursts and it’s underestimation of their risk.

20110122-093121

Why we may be carrying an order of magnitude greater risk from aircraft engine rotor bursts than we thought

One of the ‘implicit’ conclusions of the 2010 AIA study on the threat posed by jet engine rotor bursts was that the fleet of modern aircraft designed to meet FAA circular AC 20-128A also met the FAA established safety targets of a 1 in 20 likelihood of a catastrophic loss, in the event of a engine rotor burst.

Continue Reading…

On June 2, 2006, an American Airlines B767-223(ER), N330AA, equipped with General Electric (GE) CF6-80A engines experienced an uncontained failure of the high pressure turbine (HPT) stage 1 disk2 in the No. 1 (left) engine during a high-power ground run for maintenance at Los Angeles International Airport (LAX), Los Angeles, California.

To provide a better appreciation of aircraft level effects I’ve taken the NTBS summary description of the damage sustained by the aircraft and illustrated it with pictures taken of the accident by bystanders and technical staff.

Continue Reading...