Why sometimes simpler is better in safety engineering.
In the early hours of Sunday 11th December 2005, a number of explosions occurred at Buncefield Oil Storage Depot in Hertfordshire. At least one of the initial explosions was of massive proportions and there was a large fire that engulfed most of the site. Over 40 people were injured and significant damage was done to commercial and residential properties in the vicinity. The fire burned for several days, destroying most of the site, emitting toxic plumes of smoke and highly contaminated runoff.
A cause and a response
The putative ’cause’ of the disaster according to the UK’s Competent Authority Strategic Management Group was the failure of a tank Independent High Level Switch (IHLS) to actuate in response to a tank high level (overflow) state and thereby shutdown pumping to the tank (1).The failure of the IHLS was, in turn, due to a handle on the switch being left in the down position by operators after they had performed a test of the switch, which rendered the switch inoperative (2).
The overflow itself was caused by operator’s continuing to operate the system with a inoperative Automated Tank Gauging (ATG) system. Due to the practice of ‘working to alarms’ in the control room, the control room supervisor was not directly monitoring tank gauging and therefore not alerted to the fact that the ATG had failed and that the tank was at risk of over-filling (3).
Shortly after the incident, the Buncefield Standards Task Group (BSTG) was formed with representatives from the Control of Major Accident Hazards (COMAH) Competent Authority and industry, to translate the lessons of Buncefield into guidance that industry could implement as rapidly as possible.
One of the BSTG’s key recommendations (BSTG 2007) was that overfill prevention systems be assessed against BS 61511: 2004 and that operators demonstrate that the overall systems for tank filling control are of high integrity, with sufficient independence to ensure timely and safe shutdown to prevent tank overflow and meeting BS EN 61511:2004.
But what does BS 61511 (IEC 61508) get us?
In summary the BSTG, and by implication COMAH as the industry regulator, has directed a probabilistic based approach using the SIL methodology of BS 61511 (the petrochemical industry translation of ISO 61508). Given the complexity of the method (4), fundamental problems with an integrity level approach, not to mention the inherent epistemic and ontological uncertainties of making probabilistic risk assessments for low frequency events it’s appropriate to ask what is the payoff in using the 61511/61508 standard, and is there a simpler (and more trustworthy) way to achieve the same outcome?
Blurring the edges
Returning to the accident itself for a moment the report makes clear, though this is never explicitly stated, that a major problem from an operational safety perspective was the ‘blurring’ of the roles of plant operation performed by operators using the ATG with that of safety, thereby giving the impression that there were multiple layers of protection for the plant when in fact there was only one, the IHLS.
In practice no reliance could or should have been placed upon plant operators given the lack of clear operational safety procedures, unreliability of the ATG and the operational practice of ‘operating to alarms’ and therefore driving the system to the limits of safe capacity.In other words the occurrence of a hazardous state due to operator error, ATG failure or a combination of the two is to be expected.
A simpler approach to BS 61511/IEC 61508
The question is of course whether there is an easier way to achieve the required level of safety, without recourse to 61511/61508. Given that we shouldn’t rely on the operators and ATG to assure safety the IHLS becomes safety critical as a failure of the IHLS to perform it’s function would result in an accident. Is there a simple way to deal with this design challenge?
One alternate non-integrity level technique for safety critical systems is simply to apply the fail safe design principle to the IHLS. The principle may be stated that in any safety critical system (or connection or sub element) a failure during a specified operating period should regardless of probability be assumed, and such a failure should not prevent either safe operation or safe control and recovery action by the operating staff (5).
Clearly a single sensor could fail in a number of ways, operator error being only one mode and a single IHLS sensor architecture is inherently single point of failure vulnerable, violating the fail safe principle. The solution is fairly simple, add an additional sensor, ideally one based on an diverse sensing technique. Had a redundant diverse design been installed the operator induced failure of the original IHLS would not have resulted in the overflow.
The great advantages of this simple fail safe design approach is it’s deterministic nature, in that it assumes the possibility of both an overflow and component failure (6) and consequently it’s simple to apply and evaluate. Compare this to the complex qualitative or quantitative probabilistic risk analysis and budgeting, that the 61511/61508 approach requires (7)(8).
Even if you elect to use probabilistic measures of safety risk, the use of a fail safe design principle in concert with such probabilistic arguments (for example to argue the achievement of a specific risk level for N>1 failures) is still of value as it reduces the epistemic risk associated with such analyses.
Down the rabbit hole
Sadly the UK regulator and petrochemical industry have like the white rabbit of Alice disappeared down the assurance rabbit hole of BS 61511/61508, the problem that this poses for industry is that for each safety system on each site a complex and inherently ‘uncertain’ risk assessment must be performed and subsequently certified. Here the appropriate question to ask is whether such a process is really robust enough to assure safety, the lesson from other industries is that sometimes simple safety strategies work the best.
1. A summary of the accident causation according to the Competent Authority Strategic Management Group can be found in COMAH (2011).
2. The down position of the handle allowed the switch to be configured for use as either a low level or high level switch. In the case of Buncefield however only the high level mode of operation was required, and the handle should have been padlocked in the neutral (horizontal position) for operational use after the test. Unfortunately operating staff were not advised of that fact by the supplier and the valve handle was left in the down position.
3. The report as many accident reports do also introduced a number of what I would call ‘red herrings’ i.e. issues that were not directly attributable to the accident.
4. See for example the LOPA example analysis included in the BSTG report, or the Integrity Analysis of Joosten (2010).
5. The concept of Single Point of Failure (SPOF) resistance is an integral part of the aerospace communities safety engineering principles.
6. And can therefore be typified as a possibilistic design approach.
7. Pick your poison in the 61508 universe. :)
8. Another problem with such analyses is that they are inherently vulnerable to ‘advocate’ bias. Had a QRA or PRA been performed on the Buncefield system I have no doubt that it would have indicated an acceptable level of safety.
BSTG, Safety And Environmental Standards for Fuel Storage Sites – BSTG Final Report, 2007.
COMAH, Buncefield: Why did it happen? Competent Authority Strategic Management Group, Crown Publishing 2011.
Joosten, J., Applying Buncefield Recommendations and IEC61508 and IEC 61511 Standards to Fuel Storage Sites, Honeywell Process Solutions, Whitepaper, WP-10-3-ENG, April 2010, Honeywell.