Archives For Complexity

Complexity, what is, how do we deal with it, and how does it contribute to risk.

One of the things that they don’t teach you at University is that as an engineer you will never have enough time. There’s never the time in the schedule to execute that perfect design process in your head which will answer all questions, satisfy all stakeholders and optimises your solution to three decimal places. Worse yet you’re going to be asked to commit to parts of your solution in detail before you’ve finished the overall design because for example, ‘we need to order the steel bill now because there’s a 6 month lead time, so where’s the steel bill?’. Then there’s dealing with the ‘internal stakeholders’ in the design process, who all have competing needs and agendas. You generally end up with the electrical team hating the mechanicals, nobody talking to structure and everybody hating manufacturing (1).

So good engineering managers, (2) spend a lot of their time managing the risk of early design commitment and the inherent concurrency of the program, disentangling design snarls and adjudicating turf wars over scarce resources (3). Get it right and you’re only late by the usual amount, get it wrong and very bad things can happen. Strangely you’ll not find a lot of guidance in traditional engineering education on these issues, but for my part what I’ve found to be helpful is a pragmatic design process that actually supports you in doing the tough stuff (4). Oh and being able to manage outsourcing of design would also be great (5). This all gets even more difficult when you’re trying to do vehicle design, which I liken to trying to stick 5 litres of stuff into a 4 litre container. So at the link below is my architecting approach to managing at least part of the insanity. I doubt that there’ll ever be a perfect answer to this, far too too many constraints, competing agenda’s and just plain cussedness of human beings. But if your last project was anarchy dialled up-to eleven it might be worth considering some of these possible approaches. Hope it helps, and good luck!


1. It is a truth universally acknowledged that engineers are notoriously bad at communicating.

2. We don’t talk about the bad.

3. Such as, cable and piping routes, whose sensors goes at the top of mast, mass budgets, and power constraints. I’m sure we’ve all been there.

4. My observation is that (some) engineers tend to design their processes to be perfect and conveniently ignore the ugly messiness of the world, because they are uncomfortable with  being accountable for decisions made under uncertainty. Of course when you can’t both follow these processes and get the job done these same engineers will use this as a shield from all blame, e.g. ‘If you’d only followed our process..’ say they, `sure…’ say I.

5. A traditional management ploy to reduce costs, but rarely does management consider that you then need to manage that outsourced effort which takes particular mix of skills. Yet another kettle of dead fish as Boeing found out on the B787.


1.  A Vehicle design process v1.0

So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?

In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:

  1. Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
  2. Provide a fallback AoA source using an AoA estimator.
  3. Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
  4. Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
  5. Use non-aristotlean logic to better model the trustworthiness of air data.
  6. Provide the current master/slave hierarchy status amongst voting channels to aircrew.
  7. Provide an obvious and intuitive way to  to remove a faulted channel allowing flight under reversionary laws (7).
  8. Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).

As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.


1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.

2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.

3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.

4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).

5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.

6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.

7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.

8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.

M1 Risk_Spectrum_redux

A short article on (you guessed it) risk, uncertainty and unpleasant surprises for the 25th Anniversary issue of the UK SCS Club’s Newsletter, in which I introduce a unified theory of risk management that brings together aleatory, epistemic and ontological risk management and formalises the Rumsfeld four quadrant risk model which I’ve used for a while as a teaching aid.

My thanks once again to Felix Redmill for the opportunity to contribute.  🙂

Strowger pre-selection

The NBN, an example of degraded societal resilience?

Back in the day the old Strowger telephone exchanges were incredibly tough electro-mechanical beasts, and great fun to play with as well. As an example of their toughness there’s the tale of how during the Chilean ‘big one’ a Strowger unit was buried in the rubble of it’s exchange building but kept happily clunking away for a couple of days until the battery wore down. Early Australian exchanges were Strowger’s, my father actually worked on them, and to power their DC lines they used to run huge battery pairs that alternated between service and charging. That built in brute strength redundancy also minimised the effect of unreliable mains power on network services, remember back in the day power wasn’t that reliable. Fast forward to 1989 when we had the Newcastle (NSW) earthquake and lo our local exchange only stayed up for a couple of hours until it’s batteries died.

Continue Reading…


Safety cases and that room full of monkeys

Back in 1943, the French mathematician Émile Borel published a book titled Les probabilités et la vie, in which he stated what has come to be called Borel’s law which can be paraphrased as, “Events with a sufficiently small probability never occur.” Continue Reading…


One of the problems that we face in estimating risk driven is that as our uncertainty increases our ability to express it in a precise fashion (e.g. numerically) weakens to the point where for deep uncertainty (1) we definitionally cannot make a direct estimate of risk in the classical sense. Continue Reading…


Deconstructing a tail strike incident

On August 1 last year, a Qantas 737-838 (VH-VZR) suffered a tail-strike while taking off from Sydney airport, and this week the ATSB released it’s report on the incident. The ATSB narrative is essentially that when working out the plane’s Takeoff Weight (TOW) on a notepad, the captain forgot to carry the ‘1’ which resulted in an erroneous weight of 66,400kg rather than 76,400kg. Subsequently the co-pilot made a transposition error when carrying out the same calculation on the Qantas iPad resident on-board performance tool (OPT), in this case transposing 6 for 7 in the fuel weight resulting in entering 66,400kg into the OPT. A cross check of the OPT calculated Vref40 speed value against that calculated by the FMC (which uses the aircraft Zero Fuel Weight (ZFW) input rather than TOW to calculate Vref40) would have picked the error up, but the crew mis-interpreted the check and so it was not performed correctly. Continue Reading…