A near disaster in space 40 years ago serves as a salutory lesson on Common Cause Failure (CCF)
Two days after the launch of Apollo 13 an oxygen tank ruptured crippling the Apollo service module upon which the the astronauts depended for survival, precipitating a desperate life or death struggle for survival. But leaving aside what was possibly NASA’s finest hour, the causes of this near disaster provide important lessons for design damage resistant architectures.
On the Apollo spacecraft electrical power was produced by three redundant fuel cells (Figure 1). These cells combined hydrogen and oxygen to generate electrical power. All fuel cells were fed by two cylindrical liquid hydrogen tanks and two spherical liquid oxygen tanks, which also supplied the environmental control system. Again their duplicated nature provided redundancy in the event of a single oxygen or hydrogen tank failure.
But importantly the fuel cells and the feed tanks were physically located in a single fuel cell bay (sector 4) of the Apollo service module (Figure 2). This decision had important ramifications for the subsequent severity of the accident. The rupture of the number 2 oxygen tank caused a immediate mechanical shock that forced the oxygen valves closed on the number 1 and number 3 fuel cells, which left them operating for approximately a further three minutes on the oxygen left in the feed lines. The initial rupture also caused damage to number 3 oxygen tank (or pipework) leading to the contents of number 3 oxygen tank leaking out into space over the next 130 minutes, entirely depleting the SM’s oxygen supply. Without oxygen the fuel cells could not operate causing a crippling loss of onboard power.
A subsequent NASA accident investigation made a number of detailed technical and procedural recommendations (see Chapter 5 part 4 of the NASA accident board report) to improve the safety of the Apollo spacecraft. But one thing that was not considered by the investigation was the contribution that the basic architecture of the vehicle made to the ultimate severity of the incident.
From both the post accident report and photographic record of the mission it’s fairly obvious that the rupture of the O2 tank did a large amount of damage to adjacent components. In effect a single component failure caused a common cause (1) set of failures in other EPS equipment that led in turn to loss of EPS service. The reason that this one failure could cause so much damage is because the designers had placed all equipment within the one bay. Or more formally the intrasystem colocation of redundant equipment within a single bay introduced spatial coupling which in turn introduced inter-equipment dependency for a O2 reservoir rupture failure mode.
Now it’s always easy to be wise after the event and there were undoubtedly persuasive reasons to locate all the electrical power systems within the one bay. For example by clustering components piping, connections and as a result mass could all be kept to a minimum. However in engineering one rarely gets something for nothing and in this case the undetected trade off was an increased vulnerability to a common cause failure. This defeated the redundancy employed to improve the reliability of these safety critical functions (2).
Had the equipment been distributed around the service module the failure of number 2 O2 tank may have had more limited effect, so an alternative and possibly more optimal design would have separated the components of the EPS either via distance (in a second bay) or using a physical partition to provide a measure of common cause failure resistance, as is done in multi-engine fighter aircraft to separate fuselage engine bays. This sort of system segregation is a basic architectural decision, one that could (and should) be made during the preliminary design for a spacecraft, or any other safety critical system for that matter. Conversely these are also design decisions that usually cannot be retrofitted in later on as changes are usually prohibitively expensive (3). Designers of the contending replacement for the space shuttle (Blue Origin, Boeing, Sierra Nevada Corp. and Space X) take note.
1. A CCF is defined (after Ivor Rasmuson) as an event in which two or more components fail or are degraded within a selected time such that success of the mission would be uncertain and these component failures result from a single shared cause and coupling mechanism. The shared cause is not another component state because these are usually modelled explicitly in system models.
2. A compounding problem in managing CCF is that in general safety critical components are designed to be highly reliable, and therefore only a few independent failures are expected for a population of components.
Multiple failures are of course even rarer and therefore a statistically valid set of system CCF data is unlikely to be available. For this reason pooling of data from equivalent or pseudo-equivalent systems may be necessary to establish such a set.
3. Which is probably why the NASA accident board did not consider this in their recommendations.