Archives For System architecting

Strowger pre-selection

The NBN, an example of degraded societal resilience?

Back in the day the old Strowger telephone exchanges were incredibly tough electro-mechanical beasts, and great fun to play with as well. As an example of their toughness there’s the tale of how during the Chilean ‘big one’ a Strowger unit was buried in the rubble of it’s exchange building but kept happily clunking away for a couple of days until the battery wore down. Early Australian exchanges were Strowger’s, my father actually worked on them, and to power their DC lines they used to run huge battery pairs that alternated between service and charging. That built in brute strength redundancy also minimised the effect of unreliable mains power on network services, remember back in the day power wasn’t that reliable. Fast forward to 1989 when we had the Newcastle (NSW) earthquake and lo our local exchange only stayed up for a couple of hours until it’s batteries died.

Continue Reading…

Morely.png

Why writing a safety case might (actually) be a good idea

Frequent readers of my blog would probably realise that I’m a little sceptical of safety cases, as Scrooge remarked to Morely’s ghost, “There’s more of gravy than of grave about you, whatever you are!” So to for safety cases, oft more gravy than gravitas about them in my opinion, regardless of what their proponents might think.

Continue Reading…

It is a common requirement to either load or update applications over the air after a distributed system has been deployed. For embedded systems that are mass market this is in fact a fundamental necessity. Of course once you do have an ability to load remotely there’s a back door that you have to be concerned about, and if the software is part of a vehicle’s control system or an insulin pump controller the consequences of leaving that door unsecured can be dire. To do this securely requires us to tackle the insecurities of the communications protocol head on.

One strategy is to insert a protocol ‘security layer’ between the stack and the application. The security layer then mediate between the application and the Stack to enforce the system’s overall security policy. For example the layer could confirm:

  • that the software update originated from an authenticated source,
  • that the update had not been modified,
  • that the update itself had been authorised, and
  • that the resources required by the downloaded software conform to any onboard safety or security policy.

There are also obvious economy of mechanism advantages when dealing with protocols like the TCP/IP monster. Who after all wants to mess around with the entirety of the TCP/IP stack, given that Richard Stevens took three volumes to define the damn thing? Similarly who wants to go through the entire process again when going from IP5 to IP6? 🙂

Defence in depth

One of the oft stated mantra’s of both system safety and cyber-security is that a defence in depth is required if you’re really serious about either topic. But what does that even mean? How deep? And depth of what exactly? Jello? Cacti? While such a statement has a reassuring gravitas, in practice it’s void of meaning unless you can point to an exemplar design and say there, that is what a defence in depth looks like. Continue Reading…

Here’s a companion tutorial to the one on integrity level partitioning. This addresses more general software hazards and how to deal with them. Again you can find a more permanent link on my publications page. Enjoy 🙂

Boeing 787-8 N787BA cockpit (Image source: Alex Beltyukov CC BY-SA 3.0)

The Dreamliner and the Network

Big complicated technologies are rarely (perhaps never) developed by one organisation. Instead they’re a patchwork quilt of individual systems which are developed by domain experts, with the whole being stitched together by a single authority/agency. This practice is nothing new, it’s been around since the earliest days of the cybernetic era, it’s a classic tool that organisations and engineers use to deal with industrial scale design tasks (1). But what is different is that we no longer design systems, and systems of systems, as loose federations of entities. We now think of and design our systems as networks, and thus our system of systems have become a ‘network of networks’ that exhibit much greater degrees of interdependence.

Continue Reading…

MH370 Satellite Image (Image source: AMSA)

While once again the media has whipped itself into a frenzy of anticipation over the objects sighted in the southern Indian ocean we should all be realistic about the likelihood of finding wreckage from MH370.

Continue Reading…

20140122-072236.jpg

The failure of NVP and the likelihood of correlated security exploits

In 1986, John Knight & Nancy Leveson conducted an experiment to empirically test the assumption of independence in N version programming. What they found was that the hypothesis of independence of failures in N-version programs could be rejected at a 99% confidence level. While their results caused quite a stir in the software community, see their A reply to the critics for a flavour, what’s of interest to me is what they found when they took a closer look at the software faults.

…approximately one half of the total software faults found involved two or more programs. This is surprisingly high and implies that either programmers make a large number of similar faults or, alternatively, that the common faults are more likely to remain after debugging and testing.

Knight, Leveson 1986

Continue Reading…

Separation of privilege and the avoidance of unpleasant surprises

Another post in an occasional series on how Saltzer and Schroeder’s eight principles of security and safety engineering seem to overlap in a number of areas, and what we might get from looking at safety with from a security perspective. In this post I’ll look at the concept of separation of privilege.

Continue Reading…

And not quite as simple as you think…

The testimony of Michael Barr, in the recent Oklahoma Toyota court case highlighted problems with the design of Toyota’s watchdog timer for their Camry ETCS-i  throttle control system, amongst other things, which got me thinking about the pervasive role that watchdogs play in safety critical systems. The great strength of watchdogs is of course that they provide a safety mechanism which resides outside the state machine, which gives them fundamental design independence from what’s going on inside. By their nature they’re also simple and small scale beasts, thereby satisfying the economy of mechanism principle.

Continue Reading…

Toyota ECM (Image source: Barr testimony presentation)Economy of mechanism and fail safe defaults

I’ve just finished reading the testimony of Phil Koopman and Michael Barr given for the Toyota un-commanded acceleration lawsuit. Toyota settled after they were found guilty of acting with reckless disregard, but before the jury came back with their decision on punitive damages, and I’m not surprised.

Continue Reading…

Singularity (Image source: Tecnoscience)

Or ‘On the breakdown of Bayesian techniques in the presence of knowledge singularities’

One of the abiding problems of safety critical ‘first of’ systems is that you face, as David Collingridge observed, a double bind dilemma:

  1. Initially an information problem because ‘real’ safety issues (hazards) and their risk cannot be easily identified or quantified until the system is deployed, but 
  2. By the time the system is deployed you now face a power (inertia) problem, that is control or change is difficult once the system is deployed or delivered. Eliminating a hazard is usually very difficult and we can only mitigate them in some fashion. Continue Reading…

New Battery boxes (Image source: Boeing)

The end of the matter…well almost

Continue Reading…

Just finished giving my post on Lessons from Nuclear Weapons Safety a rewrite.

The original post is, as the title implies, about what we can learn from the principled base approach to safety adopted by the US DOE nuclear weapons safety community. Hopefully the rewrite will make it a little clearer, I can be opaque as a writer sometimes. 🙂

P.S. I probably should look at integrating the 3I principles introduced into this post on the philosophy of safety critical systems.

Warsaw A320 Accident (Image Source: Unknown)

One of the questions that we should ask whenever an accident occurs is whether we could have identified the causes during design? And if we didn’t, is there a flaw in our safety process?

Continue Reading…

The MIL-STD-882 lexicon of hazard analyses includes the System Hazard Analysis (analysis) which according to the standard is intended to:

“…examines the interfaces between subsystems. In so doing, it must integrate the outputs of the SSHA. It should identify safety problem areas of the total system design including safety critical human errors, and assess total system risk. Emphasis is placed on examining the interactions of the subsystems.”

MIL-STD-882C

This sounds reasonable in theory and I’ve certainly seen a number toy examples touted in various text books on what it should look like. But, to be honest, I’ve never really been convinced by such examples, hence this post.

Continue Reading…

In an article published in the online magazine Spectrum Eliza Strickland has charted the first 24 hours at Fukushima. A sobering description of the difficulty of the task facing the operators in the wake of the tsunami.

Her article identified a number of specific lessons about nuclear plant design, so in this post I thought I’d look at whether more general lessons for high consequence system design could be inferred in turn from her list.

Continue Reading…

Why We Automate Failure
A recent post on the interface issues surrounding the use of side-stick controllers in current generation passenger aircraft led me to think more generally about the the current pre-eminence of software driven visual displays and why we persist in their use even though there may be a mismatch between what they can provide and what the operator needs.

Continue Reading…

The Mississippi River’s Old River Control Structure, a National Single Point of Failure?

Given the recent events in Fukushima and our subsequent western cultural obsession with the radiological consequences, perhaps it’s appropriate to reflect on other non-nuclear vulnerabilities. A case in point is the Old River Control Structure erected by those busy chaps the US Army Corp of Engineers to control the path of the Mississippi to the sea. Well as it turns out trapping the Mississippi wasn’t really such a good idea…

Continue Reading…

Soviet Shuttle was safer by design

According to veteran russian cosmonaut Oleg Kotov, quoted in a New Scientist article the soviet Buran shuttle (1) was much safer than the American shuttle due to fundamental design decisions. Kotov’s comments once again underline the importance to safety of architectural decisions in the early phases of a design.

Continue Reading…

Just discovered a paper I co-authored for the 2006 AIAA Reno Conference on the Risk & Safety Aspects of Systems of Systems. A little disjointed but does cover some interesting problem areas for systems of systems.

A UAV and COMAIR near miss over Kabul illustrates the problem of emergent hazards when we integrate systems or operate existing systems in operational contexts not considered by their designers.

Continue Reading...

A near disaster in space 40 years ago serves as a salutory lesson on Common Cause Failure (CCF)

Two days after the launch of Apollo 13 an oxygen tank ruptured crippling the Apollo service module upon which the the astronauts depended for survival, precipitating a desperate life or death struggle for survival. But leaving aside what was possibly NASA’s finest hour, the causes of this near disaster provide important lessons for design damage resistant architectures.

Continue Reading…

Blayais Plant (Image source: Wikipedia Commons)

What a near miss flooding incident at a French nuclear plant in 1999 and the Fukushima 2012 disaster can tell us about fault tolerance and designing for reactor safety

Continue Reading…

QF32 Redux

29/03/2011

QF32 - No. 1 engine failure to shutdown

The ABC’s treatment of the QF 32 incident treads familiar and slightly disappointing ground

While I thought that the ABC 4 Corners programs treatment of the QF 32 incident was a creditable effort I have to say that I was unimpressed by the producers homing in on a (presumed) Rolls Royce production error as the casus belli.

The report focused almost entirely upon the engine rotor burst and its proximal cause but failed to discuss (for example) the situational overload introduced by the ECAM fault reporting, or for that matter why a single rotor burst should have caused so much cascading damage and so nearly led to the loss of the aircraft.

Overall two out of four stars 🙂

If however your interested in a discussion of the deeper issues arising from this incident then see:

  1. Lessons from QF32. A discussion of some immediate lessons that could be learned from the QF 32 accident;
  2. The ATSB QF32 preliminary report. A commentary on the preliminary report and its strengths and weaknesses;
  3. Rotor bursts and single points of failure. A review and discussion of the underlying certification basis for commercial aircraft and protection from rotor burst events;
  4. Rotor bursts and single points of failure (Part II), Discusses differences between the damage sustained by QF 32 and that premised by a contemporary report issued by the AIA on rotor bursts;
  5. A hard rain is gonna fall. An analysis of 2006 American Airlines rotor burst incident that indicated problems with the FAA’s assumed rotor burst debris patterns; and
  6. Lies, damn lies and statistics. A statistical analysis, looking at the AIA 2010 report on rotor bursts and it’s underestimation of their risk.

On June 2, 2006, an American Airlines B767-223(ER), N330AA, equipped with General Electric (GE) CF6-80A engines experienced an uncontained failure of the high pressure turbine (HPT) stage 1 disk2 in the No. 1 (left) engine during a high-power ground run for maintenance at Los Angeles International Airport (LAX), Los Angeles, California.

To provide a better appreciation of aircraft level effects I’ve taken the NTBS summary description of the damage sustained by the aircraft and illustrated it with pictures taken of the accident by bystanders and technical staff.

Continue Reading...

A report by the AIA on engine rotor bursts and their expected severity raises questions about the levels of damage sustained by QF 32.

Continue Reading...

Lessons from QF 32

06/11/2010

The recent Qantas QF32 engine failure illustrates the problems of dealing with common cause failure

This post is part of the Airbus aircraft family and system safety thread.

Updated: 15 Nov 2012

Generally the reason we have more than one of anything on a passenger aircraft is because we know that components can fail so independent redundancy is the cornerstone strategy to achieve the required levels of system reliability and safety. But while overall aircraft safety is predicated on the independence of these components, the reality is that the catastrophic failure of one component can also affect adjacent equipment and systems leading to what are termed common cause failures.

Continue Reading…

Over the last couple of months I’ve posted on various incidents involving the Airbus A330 aircraft from the perspective of system safety. As these posts are scattered through my blog I thought I’d pull them together, the earliest post is at the bottom.

Continue Reading...

So far as we know flight AF 447 fell out of the sky with its systems performing as their designers had specified, if not how they expected, right up-to the point that it impacted the surface of the ocean.

So how is it possible that incorrect air data could simultaneously cause upsets in aircraft functions as disparate as engine thrust management, flight law protection and traffic avoidance?

Continue Reading...

Ariane 501 Launch

I was cleaning up some of my reference material and came across a copy of the ESA board of investigation report into the Ariane 501 accident. I’ve added my own personal observations, as well as those of other commentators, to the report. Continue Reading…

A cross walk of the interim investigation accident reports issued by the ATSB and BEA for the QF72 and AF447 accidents respectively shows that in both accidents the inertial reference units that are part of the onboard air data inertial reference unit (ADIRU) that exhibited anomalous behaviour also declared a failure. Why did this occur?

Continue Reading...