What a near miss flooding incident at a French nuclear plant in 1999 and the Fukushima 2012 disaster can tell us about fault tolerance and designing for reactor safety
In safety critical applications often the safety requirement is for the system to continue to provide a service in the face of component faults. As a topical example in nuclear reactors the provision of a cooling water service to the reactor after shut-down is vital to prevent the thermal heat from residual fission products causing the reactor core to overheat. Given that components fail, this means a fault tolerant design is essential.
Developing a working hypothesis about these faults, component failure rates and how such components fail is a critical, perhaps the most critical, step in the development of such a fault tolerant system (1).
Fault hypothesis equals assumption
The hypothesis we form about the type of faults that we expect to occur dictates the type and amount of fault tolerance schemes (such as redundancy) that we subsequently implement in the system.
But a fault hypothesis is also a set of assumptions, which may or may not prove to be correct in practice. The extent to which they do so is the assumption ‘coverage’ or more precisely the likelihood that the assumption proves true conditioned on the fact that a component has failed (Powell 1995). In essence each assumption carries with it an element of ontological risk.
So what insight could the concept of a fault hypothesis provide into an event such as Fukushima? Is it meaningful to talk about a fault hypothesis in the context of an external event? I think the answer to that question is yes. A flooding event may occur due to a number of different causes; what we assume about these causes, and how reasonable these assumptions are, goes directly to the safety of the plant design.
The Blayais flooding incident
The best example of this is the flooding incident that occurred at the French Blayais reactor complex in 1999. Here a combination of tide, storm surge and storm wave heights over-topped the coastal dyke around the plant causing severe flooding onsite.
The flooding caused widespread damage including loss of one of the two redundant service pumps of No 1 unit’s Essential Service Water System. and loss of the low-head safety injection and containment spray pumps of the Emergency Core Cooling System. In summary a single common event knocked out or degraded significant portions of the plants essential and emergency cooling services.
Before the events at Blayais the extant reactor design standard (RFS 1.2e (1984)) only required that river, dam, littoral (tide + storm surge) and estuary site flooding modes be considered. But in the Blayais incident the combined effects of storm surge, wave height and high tide exceeded the worst case assumed by the standard (2)(3).
Clearly assumptions as to the mode of flood events and the concurrency of such events drove the type and level of design countermeasure. In the case of Blayais the designers in adhering to the standard failed to consider both a specific mode of flooding (e.g. wave height) as well as the possibility of such modes occurring simultaneously. In the case of Fukushima the flooding safety study excluded tsunami modes of flooding.
This failure to specify resulted in a fault in the specified fault hypothesis, e.g incomplete assumption coverage, and both plants ability to operate in the face of such flooding became non-deterministic (4). Under nominal conditions this latent design fault was silent in both plant but waiting to be triggered by the right flood event (5).
There is also subtle point here that making incorrect assumptions about common cause failure modes entails more ontological risk than making assumptions about individual components because a common cause failure can (by definition) knock out system redundancy in one event.
Never give up strategies
Kopetz (2003) also made the point in his paper that for a safety critical control system, it is reasonable to expect that the control system should never give up, even if the fault hypothesis is violated by reality. Taking this principal and applying it more generally, all safety critical systems (such as those we are discussing) should also have a ‘never give up’ strategy that addresses what to do when a combination of events overwhelms the defenses. In the case of flooding protection for a nuclear plant this should logically include what to do if the flood barrier is breached.
We should also consider the possibility of ‘highly correlated external faults’, as part of our fault hypothesis. For example a major storm will usually have significant rainfall, wind and storm surge effects upon the littoral environment. Similarly earthquakes and tsunamis cannot be considered independent events and will affect offsite support.
Is this assumption necessary, and do we really believe it?
Drawing upon Kopetz’s (2003) suggested phases of design for a fault tolerant system the last stage of design is to validate the assumptions that form the fault hypothesis. In the case of both Fukushima and the preceding incident at Blayais the assumption coverage for both plants was violated by flooding modes in the real world. In both cases there was recognition of deficiencies in the safety regulations but no clear and assertive response to the problem.
Making the case for safety
But the greatest value of a fault hypothesis is that it drives the designer to explicitly state key safety assumptions as just that, assumptions. If we cannot validate these assumptions before we deploy a system operationally then logically there exists an obligation to maintain a watching brief and as experience accumulates review and reassess the systems fault tolerance (6).
The necessary and continual review of the validity of critical design assumptions is to my mind the primary justification for maintaining an active safety program and argument for major technological systems throughout their life (7).
1. Note that throughout I use fault, failure and error as defined by Avizienis inter alia (2001) for dependability. That is:
- Fault. The adjudged or hypothesised cause of an error. A fault is active when it produces an error, otherwise it is dormant.
- Failure. An event that occurs when the delivered service deviates from correct service. A failure is a transition from correct to incorrect service for a period termed a service outage. A system may fail because the system does not meet the specification, or because the specification did not specify the correct functionality.
- Error. That part of the system state that may cause a subsequent failure. The failure occurs when an error reaches the service interface and alters the service.
- Service. The service delivered by a system is its behavior as it is perceived by its user(s), a user is another system (physical, human) that interacts with the former at the service interface.
- Function. What the system is intended for, and is described by the system specification.
2. The Blayais flooding incident also exposed design deficiencies in the:
- height and shape of the dykes,
- protection of below platform equipment rooms containing safety equipment,
- detection of flooding in affected rooms,
- Inadequate warning systems,
- on-site organisational difficulties of coordinating across all 4 units
- post flood site access and off site communications loss,
- dependability of offsite power
- dependability of cooling water availability (due to flood debris clogged inlet grates)
3. The 1998 annual review of plant safety for the plant also identified the need for the sea walls to be raised to a uniform height of 5.7 m (19 ft) above NGF although (unfortunately) work had been postponed until 2002 prior to the 1999 flood.
4. Nondeterministic in the sense that it has been designed to perform it’s required functions in a set of specified environments, so it’s behaviour in an abnormal (out of specification) environment will be ‘ad hoc’ or unknown. Other design decisions (unrelated to flood tolerance) would determine the plants resistance to flooding.
5. As a historical note after the Blaiyais flooding event the standard’s potential flooding modes (and combinations thereof) were revised to include:
- river flood,
- dam failure,
- storm surge,
- waves caused by wind on the sea,
- waves caused by wind on river or channel,
- reservoir swelling due to the operation of valves or pumps,
- deterioration of water retaining structures (other than dams),
- circuit or equipment failure,
- brief and intense rainfall on site,
- regular and continuous rainfall on site, and
- rises in groundwater.
6. As was subsequently implemented at Blayais and other French nuclear plants after the Blayais flooding incident.
7. In the case of weather events this may also be necessary to address the effects of ongoing climate change upon littoral facilities. As such climate change represents a change in the operational context of a system.
1. Avizienis, A., Laprie, J.C., and Randall, B., Fundamental Concepts of Dependability. Research Report No 1145, LAAS-CNRS 2001.
2. de Fraguier, E., Lessons Learned from 1999 Blayais Flood: Overview of the EDF Flood Risk Management Plan, EDF, published 3 March 2010.
3. Kopetz, On the Fault Hypothesis for a Safety-Critical Real-Time System, Technische Universität Wien, Proceedings of the Automotive Workshop, 2004.
4. Powell, D., Failure Mode Assumptions and Assumption Coverage, LASAS-CNRS, Research Report 91462, 1995.