Fault hypothesis and external events
The hypothesis we form about the type of faults that we expect to occur dictates the type and amount of fault tolerance schemes, such as redundancy, that we might choose to implement in our system. But a fault hypothesis is also a set of assumptions, which may or may not prove to be correct in practice. The extent to which they do so is the assumption ‘coverage’ or more precisely the likelihood that the assumption proves true conditioned on the fact that a component has failed (Powell 1995). In essence each assumption carries with it an element of epistemic risk.
So what insight could the concept of a fault hypothesis provide into an event such as Fukushima? Is it meaningful to talk about a fault hypothesis in the context of such external events? I think yes, a flooding event (or any other) may have any number of different causes and any number of different modalities. What we might reasonably assume about the causes of external events, how good our predictive powers really are and how truly reasonable these assumptions might turn out to be go directly to the safety of nuclear plant design.
Ou first example of the power of assumptions is the flooding incident that occurred at the French Blayais reactor complex in 1999. Here a combination of tide, storm surge and storm waves overtopped the coastal dyke around the plant causing flooding onsite. The flooding caused widespread damage including loss of one of the two redundant service pumps of No 1 unit’s Essential Service Water System (ESSWS) and, for both Units 1 & 2, loss of the low-head safety injection and containment spray pumps of the Emergency Core Cooling System (ECCS). In summary a single common cause event knocked out or degraded significant portions of the plants essential and emergency cooling services. Before the events at Blayais the extant French reactor design standard (RFS 1.2e (1984)) only required that river, dam, littoral (tide and storm surge) and estuary site flooding modes be considered. But in the Blayais incident the combined efects of storm surge, wave height and high tide exceeded the worst case assumed by the standard (2)(3).
Our second is the Fukushima Daichii disaster, where the reactor safety study excluded consideration of the potential for tsunami’s in the area. The eventual tsunami that turned up overtopped the plant’s sea wall by 8 metres flooding the plant and knocking out station services, leading in turn to loss of reactor cooling, overheating of the reactor cores, hydrogen explosions and extreme damage to the reactor cores with subsequent heavy loss of radioactive materials into the environment.
Clearly assumptions as to the mode (the type of flooding event) and concurrency events drove the type and level of design countermeasure. In the case of Blayais the designers, while adhering to the standard, failed to consider all the causes of flooding and their concurrent occurrence. In the case of the Fukushima safety study the author’s excluded the possibility of tsunami driven flooding. This failure in specifying resulted in what we could call an incompleteness fault in the specified plant’s fault hypothesis. The implications of such unspecified events occurring is not that the plant will necessarily fail disastrously when they occur, but it is that after such an event the plant’s behaviour will be non-deterministic in nature (4). In the case of Blayais the loss of equipment and the resulting service ‘error states’ did not propagate to the system interface (5), in the case of Fukushima it did (Figure 1).
There is a subtle point here that highlights the importance of near miss event’s such as Blayais and why we should not relax just because the error did not propagate into a major disaster. The real concern of such events is that our system has ended up in a non-deterministic (unspecified) state where we cannot predict the outcome. In such circumstances outcomes are dictated more by the vagaries of chance than good design, with usually tragic results. So after a near miss event we should carefully review the circumstances to determine not just what went right and wrong, but also whether our hypotheses are truly complete, or not.
Kopetz in his paper on fault hypotheses for safety critical systems ( Kopetz 2003) makes the point that for a safety critical system it is reasonable for us to expect that even should the fault hypothesis be violated by reality the system should not just give up the ghost. He points out that for control systems a violation of the fault hypothesis may occur due to a highly correlated set of external transient faults and that in those circumstances a ‘never give up’ strategy, he uses the example of a quick restart, would be appropriate. Taking this principal and abstracting it to the more general case we might observe that safety critical systems in general should be designed to meet a specified ‘never give up’ strategy that addresses what to do in the event that the systems abilities are, hopefully only temporarily, overwhelmed, which all starts to sound like we’re designing for resilience doesn’t it?
The other part of Kopetz’s point is that we should always consider the possibility of what he calls ‘highly correlated external faults’, as part of our safety critical system’s fault hypothesis. Returning to our Blayais example for a moment, we could reasonably expect that any major storm will have significant and combined rainfall, wind and storm surge effects upon the littoral environment. So we should treat these disparate causes as dependent when considering the probability of their occurring simultaneously. Fairly obviously in the case of Fukushima earthquakes and tsunamis are unlikely also to be dependent events. Statistically this correlation of extreme events is termed heavy tail dependence and that dependence can be a significant source of unidentified risk if we fail to acknowledge it.
In the case of a site wide flooding as occurred at Blayais and Fukushima incident we might imagine the ‘never give up’ strategy comprising the hardening of specific below grade equipment rooms, pumps and cable runs to allow continued critical services should the flood barriers be overtopped. At a minimum it should include detailed recovery options and plans to allow restoration of the essential service within the time limits imposed by reactor heating. Another way to look at the problem is to consider the dictum of ‘defence in depth‘ from the perspective of medieval castle designers. If you look at a good castle design, you’ll general find a series of concentric defences, with each interior ring being higher and tougher to overcome than the next outer ring. That is, not only is the next inner ring tougher to breach, but that inner ring can in turn help support the outer defences (6).
Drawing upon Kopetz’s (2004) suggested phases of design for a fault tolerant system the last stage of design is to validate the assumptions that form the fault hypothesis. In the case of both Fukushima and the preceding incident at Blayais the assumption coverage for both plants was violated by events in the real world. In both cases there was recognition of deficiencies in the safety regulations but no clear and assertive response to the problem.
By far the greatest value of any fault hypothesis is that it drives the designer to explicitly state key safety assumptions as just that, assumptions. If we cannot validate these assumptions before we deploy a system operationally then logically there exists an obligation to maintain a watching brief and as experience accumulates review and reassess the assumptions that underpin our system’s fault tolerance (7). The necessary and continual review of the validity of critical design assumptions is to my mind the primary justification for maintaining a safety case for major technological systems throughout their life (8).
1. Note that throughout I use fault, failure and error as defined by Avizienis inter alia (2001) for dependability. That is:
- Fault. The adjudged or hypothesised cause of an error. A fault is active when it produces an error, otherwise it is dormant.
- Failure. An event that occurs when the delivered service deviates from correct service. A failure is a transition from correct to incorrect service for a period termed a service outage. A system may fail because the system does not meet the specification, or because the specification did not specify the correct functionality.
- Error. That part of the system state that may cause a subsequent failure. The failure occurs when an error reaches the service interface and alters the service.
- Service. The service delivered by a system is its behavior as it is perceived by its user(s), a user is another system (physical, human) that interacts with the former at the service interface.
- Function. What the system is intended for, and is described by the system specification.
2. The Blayais flooding incident exposed design deficiencies in the:
- height and shape of the dykes,
- protection of below platform equipment rooms containing safety equipment,
- detection of flooding in affected rooms,
- Inadequate warning systems,
- on-site organisational difficulties of coordinating across all 4 units
- post flood site access and off site communications loss,
- dependability of offsite power
- dependability of cooling water availability (due to flood debris clogged inlet grates)
3. The 1998 annual review of plant safety for the plant also identified the need for the sea walls to be raised to a uniform height of 5.7 m (19 ft) above NGF although (unfortunately) work had been postponed until 2002 prior to the 1999 flood.
4. Nondeterministic in the sense that it has been designed to perform it’s required functions in a set of specified environments, so it’s behaviour in an abnormal (out of specification) environment will be ‘ad hoc’ or unknown. Other design decisions (unrelated to flood tolerance) will then determine the plants resistance to flooding.
5. As a historical note after the Blaiyais flooding event the standard’s potential flooding modes (and combinations thereof) were revised to include:
- river flood,
- dam failure,
- storm surge,
- waves caused by wind on the sea,
- waves caused by wind on river or channel,
- reservoir swelling due to the operation of valves or pumps,
- deterioration of water retaining structures (other than dams),
- circuit or equipment failure,
- brief and intense rainfall on site,
- regular and continuous rainfall on site, and
- rises in groundwater.
6. As an example of the castle defence see the Beznau NPP Notstand Building a bunkered (1 m of concrete wall and waterproof doors) safety facility that provides the following major features:
- an independent feedwater system
- an independent RCP seal injection system
- an external recirculation system
- one of the three safety injection pumps replaced by a new pump in the bunker
- two ECCS accumulators
- a separate offsite grid supply and an independent diesel generator (with a crosstie to the other unit)
- a separate cooling water supply by a independent well water system (with a crosstie to the other unit)
- an independent instrumentation and control system
- a separate control room to actuate and control the Notstand equipment.
The Notstand systems are designed as a single-train redundant backup to the other plant systems. However, at any single failure of an active component, the operators can align another component to enable core cooling (for example by alignment of a crosstie to the other unit). The first and automatic train of the Notstand systems is designed to start and run automatically for at least 10 hours. All equipment and structures are designed to meet the current licensing requirements for external events (seismic, fire separation, etc.) (Source: M. Richner, Beznau NPP quoted in Epstein (2012)).
6. As was subsequently implemented at Blayais and other French nuclear plants after the Blayais flooding incident.
7. In the case of weather events this may also be necessary to address the effects of ongoing climate change upon littoral facilities. As such climate change represents a change in the operational context of a system.
Avizienis, A., Laprie, J.C., and Randall, B., Fundamental Concepts of Dependability. Research Report No 1145, LAAS-CNRS 2001.
de Fraguier, E., Lessons Learned from 1999 Blayais Flood: Overview of the EDF Flood Risk Management Plan, EDF, published 3 March 2010.
Epstein, W., A PRA Practitioner Looks at the Fukushima Daiichi Accident, presentation, 19 March 2012.
Kopetz, On the Fault Hypothesis for a Safety-Critical Real-Time System, Technische Universität Wien, Proceedings of the Automotive Workshop, 2004.
Powell, D., Failure Mode Assumptions and Assumption Coverage, LASAS-CNRS, Research Report 91462, 1995.