Talking to one another not intuitive for engineers…
Archives For Uncategorized
One of the things that they don’t teach you at University is that as an engineer you will never have enough time. There’s never the time in the schedule to execute that perfect design process in your head which will answer all questions, satisfy all stakeholders and optimises your solution to three decimal places. Worse yet you’re going to be asked to commit to parts of your solution in detail before you’ve finished the overall design because for example, ‘we need to order the steel bill now because there’s a 6 month lead time, so where’s the steel bill?’. Then there’s dealing with the ‘internal stakeholders’ in the design process, who all have competing needs and agendas. You generally end up with the electrical team hating the mechanicals, nobody talking to structure and everybody hating manufacturing (1).
So good engineering managers, (2) spend a lot of their time managing the risk of early design commitment and the inherent concurrency of the program, disentangling design snarls and adjudicating turf wars over scarce resources (3). Get it right and you’re only late by the usual amount, get it wrong and very bad things can happen. Strangely you’ll not find a lot of guidance in traditional engineering education on these issues, but for my part what I’ve found to be helpful is a pragmatic design process that actually supports you in doing the tough stuff (4). Oh and being able to manage outsourcing of design would also be great (5). This all gets even more difficult when you’re trying to do vehicle design, which I liken to trying to stick 5 litres of stuff into a 4 litre container. So at the link below is my architecting approach to managing at least part of the insanity. I doubt that there’ll ever be a perfect answer to this, far too too many constraints, competing agenda’s and just plain cussedness of human beings. But if your last project was anarchy dialled up-to eleven it might be worth considering some of these possible approaches. Hope it helps, and good luck!
1. It is a truth universally acknowledged that engineers are notoriously bad at communicating.
2. We don’t talk about the bad.
3. Such as, cable and piping routes, whose sensors goes at the top of mast, mass budgets, and power constraints. I’m sure we’ve all been there.
4. My observation is that (some) engineers tend to design their processes to be perfect and conveniently ignore the ugly messiness of the world, because they are uncomfortable with being accountable for decisions made under uncertainty. Of course when you can’t both follow these processes and get the job done these same engineers will use this as a shield from all blame, e.g. ‘If you’d only followed our process..’ say they, `sure…’ say I.
5. A traditional management ploy to reduce costs, but rarely does management consider that you then need to manage that outsourced effort which takes particular mix of skills. Yet another kettle of dead fish as Boeing found out on the B787.
So here’s a question for the safety engineers at Airbus. Why display unreliable airspeed data if it truly is that unreliable?
In slightly longer form. If (for example) air data is so unreliable that your automation needs to automatically drop out of it’s primary mode, and your QRH procedure is then to manually fly pitch and thrust (1) then why not also automatically present a display page that only provides the data that pilots can trust and is needed to execute the QRH procedure (2)? Not doing so smacks of ‘awkward automation’ where the engineers automate the easy tasks but leave the hard tasks to the human, usually with comments in the flight manual to the effect that, “as it’s way too difficult to cover all failure scenarios in the software it’s over to you brave aviator” (3). This response is however something of a cop out as what is needed is not a canned response to such events but rather a flexible decision and situational awareness (SA) toolset that can assist the aircrew in responding to unprecedented events (see for example both QF72 and AF447) that inherently demand sense-making as a precursor to decision making (4). Some suggestions follow:
- Redesign the attitude display with articulated pitch ladders, or a Malcom’s horizon to improve situational awareness.
- Provide a fallback AoA source using an AoA estimator.
- Provide actual direct access to flight data parameters such as mach number and AoA to support troubleshooting (5).
- Provide an ability to ‘turn off’ coupling within calculated air data to allow rougher but more robust processing to continue.
- Use non-aristotlean logic to better model the trustworthiness of air data.
- Provide the current master/slave hierarchy status amongst voting channels to aircrew.
- Provide an obvious and intuitive way to to remove a faulted channel allowing flight under reversionary laws (7).
- Inform aircrew as to the specific protection mode activation and the reasons (i.e. flight data) triggering that activation (8).
As aviation systems get deeper and more complex this need to support aircrew in such events will not diminish, in fact it is likely to increase if the past history of automation is any guide to the future.
1. The BEA report on the AF447 disaster surveyed Airbus pilots for their response to unreliable airspeed and found that in most cases aircrew, rather sensibly, put their hands in their laps as the aircraft was already in a safe state and waited for the icing induced condition to clear.
2. Although the Airbus Back Up Speed Display (BUSS) does use angle-of-attack data to provide a speed range and GPS height data to replace barometric altitude it has problems at high altitude where mach number rather than speed becomes significant and the stall threshold changes with mach number (which it doesn’t not know). As a result it’s use is 9as per Airbus manuals) below 250 FL.
3. What system designers do, in the abstract, is decompose and allocate system level behaviors to system components. Of course once you do that you then need to ensure that the component can do the job, and has the necessary support. Except ‘apparently’ if the component in question is a human and therefore considered to be outside’ your system.
4. Another way of looking at the problem is that the automation is the other crew member in the cockpit. Such tools allow the human and automation to ‘discuss’ the emerging situation in a meaningful (and low bandwidth) way so as to develop a shared understanding of the situation (6).
5. For example in the Airbus design although AoA and Mach number are calculated by the ADR and transmitted to the PRIM fourteen times a second they are not directly available to aircrew.
6. Yet another way of looking at the problem is that the principles of ecological design needs to be applied to the aircrew task of dealing with contingency situations.
7. For example in the Airbus design the current procedure is to reach up above the Captain’s side of the overhead instrument panel, and deselect two ADRs…which ones and the criterion to choose which ones are not however detailed by the manufacturer.
8. As the QF72 accident showed, where erroneous flight data triggers a protection law it is important to indicate what the flight protection laws are responding to.
One of the perennial problems we face in a system safety program is how to come up with a convincing proof for the proposition that a system is safe. Because it’s hard to prove a negative (in this case the absence of future accidents) the usual approach is to pursue a proof by contradiction, that is develop the negative proposition that the system is unsafe, then prove that this is not true, normally by showing that the set of identified specific propositions of `un-safety’ have been eliminated or controlled to an acceptable level. Enter the term `hazard’, which in this context is simply shorthand for a specific proposition about the unsafeness of a system. Now interestingly when we parse the set of definitions of hazard we find the recurring use of terms like, ‘condition’, ‘state’, ‘situation’ and ‘events’ that should they occur will inevitably lead to an ‘accident’ or ‘mishap’. So broadly speaking a hazard is a explanation based on a defined set of phenomena, that argues that if they are present, and given there exists some relevant domain source (1) of hazard an accident will occur. All of which seems to indicate that hazards belong to a class of explanatory models called covering laws. As an explanatory class Covering laws models were developed by the logical positivist philosophers Hempel and Popper because of what they saw as problems with an over reliance on inductive arguments as to causality.
As a covering law explanation of unsafeness a hazard posits phenomenological facts (system states, human errors, hardware/software failures and so on) that confer what’s called nomic expectability on the accident (the thing being explained). That is, the phenomenological facts combined with some covering law (natural and logical), require the accident to happen, and this is what we call a hazard. We can see an archetypal example in the Source-Mechanism-Outcome model of Swallom, i.e. if we have both a source and a set of mechanisms in that model then we may expect an accident (Ericson 2005). While logical positivism had the last nails driven into it’s coffin by Kuhn and others in the 1960s and it’s true, as Kuhn and others pointed out, that covering model explanations have their fair share of problems so to do other methods (2). The one advantage that covering models do possess over other explanatory models however is that they largely avoid the problems of causal arguments. Which may well be why they persist in engineering arguments about safety.
1. The source in this instance is the ‘covering law’.
2. Such as counterfactual, statistical relevance or causal explanations.
Ericson, C.A. Hazard Analysis Techniques for System Safety, page 93, John Wiley and Sons, Hoboken, New Jersey, 2005.
Here’s a working draft of the introduction and first chapter of my book…. Enjoy 🙂
We are hectored almost daily basis on the imminent threat of islamic extremism and how we must respond firmly to this real and present danger. Indeed we have proceeded far enough along the escalation of response ladder that this, presumably existential threat, is now being used to justify talk of internment without trial. So what is the probability that if you were murdered, the murderer would be an immigrant terrorist?
In NSW in 2014 there were 86 homicides, of these 1 was directly related to the act of a homegrown islamist terrorist (1). So there’s a 1 in 86 chance that in that year if you were murdered it was at the hands of a mentally disturbed asylum seeker (2). Hmm sounds risky, but is it? Well there was approximately 2.5 million people in NSW in 2014 so the likelihood of being murdered (in that year) is in the first instance 3.44e-5. To figure out what the likelihood of being murdered and that murder being committed by a terrorist we just multiply this base rate by the probability that it was at the hands of a `terrorist’, ending up with 4e-7 or 4 chances in 10 million that year. If we consider subsequent and prior years where nothing happened that likelihood becomes even smaller.
Based on this 4 in 10 million chance the NSW government intends to build a super-max 2 prison in NSW, and fill it with ‘terrorists’ while the Federal government enacts more anti-terrorism laws that take us down the road to the surveillance state, if we’re not already there yet. The glaring difference between the perception of risk and the actuality is one that politicians and commentators alike seem oblivious to (3).
1. One death during the Lindt chocolate siege that could be directly attributed to the `terrorist’.
2. Sought and granted in 2001 by the then Liberal National Party government.
3. An action that also ignores the role of prisons in converting inmates to Islam as a route to recruiting their criminal, anti-social and violent sub-populations in the service of Sunni extremists.
The Sydney Morning Herald published an article this morning that recounts the QF72 midair accident from the point of view of the crew and passengers, you can find the story at this link. I’ve previously covered the technical aspects of the accident here, the underlying integrative architecture program that brought us to this point here and the consequences here. So it was interesting to reflect on the event from the human perspective. Karl Weick points out in his influential paper on the Mann Gulch fire disaster that small organisations, for example the crew of an airliner, are vulnerable to what he termed a cosmology episode, that is an abruptly one feels deeply that the universe is no longer a rational, orderly system. In the case of QF72 this was initiated by the simultaneous stall and overspeed warnings, followed by the abrupt pitch over of the aircraft as the flight protection laws engaged for no reason.
Weick further posits that what makes such an episode so shattering is that both the sense of what is occurring and the means to rebuild that sense collapse together. In the Mann Gulch blaze the fire team’s organisation attenuated and finally broke down as the situation eroded until at the end they could not comprehend the one action that would have saved their lives, to build an escape fire. In the case of air crew they implicitly rely on the aircraft’s systems to `make sense’ of the situation, a significant failure such as occurred on QF72 denies them both understanding of what is happening and the ability to rebuild that understanding. Weick also noted that in such crises organisations are important as they help people to provide order and meaning in ill defined and uncertain circumstances, which has interesting implications when we look at the automation in the cockpit as another member of the team.
“The plane is not communicating with me. It’s in meltdown. The systems are all vying for attention but they are not telling me anything…It’s high-risk and I don’t know what’s going to happen.”
Capt. Kevin Sullivan (QF72 flight)
From this Weickian viewpoint we see the aircraft’s automation as both part of the situation `what is happening?’ and as a member of the crew, `why is it doing that, can I trust it?’ Thus the crew of QF72 were faced with both a vu jàdé moment and the allied disintegration of the human-machine partnership that could help them make sense of the situation. The challenge that the QF72 crew faced was not to form a decision based on clear data and well rehearsed procedures from the flight manual, but instead they faced much more unnerving loss of meaning as the situation outstripped their past experience.
“Damn-it! We’re going to crash. It can’t be true! (copilot #1)
“But, what’s happening? copilot #2)
AF447 CVR transcript (final words)
Nor was this an isolated incident, one study of other such `unreliable airspeed’ events, found errors in understanding were both far more likely to occur than other error types and when they did much more likely to end in a fatal accident. In fact they found that all accidents with a fatal outcome were categorised as involving an error in detection or understanding with the majority being errors of understanding. From Weick’s perspective then the collapse of sensemaking is the knock out blow in such scenarios, as the last words of the Air France AF447 crew so grimly illustrate. Luckily in the case of QF72 the aircrew were able to contain this collapse, and rebuild their sense of the situation, in the case of other such failures, such as AF447, they were not.