How do ya do and shake hands, shake hands, shake hands. How do ya do and shake hands and state your name and business…
Lewis Carrol, Through the Looking Glass
You would have thought after the Leveson and Knight experiments that the theory that independently written software would only contain independent faults was dead and buried, another beautiful theory shot down by hard cold fact. But unfortunately like many great errors the theory of n-versioning keeps on keeping on (1).
Now this would be of only passing interests if it weren’t for the fact that engineers actually continue to build systems that relay on what is effectively pseudo-scientific gibberish (2). Here’s the problem, to achieve a particular system reliability engineers have traditionally used redundancy. So (for example) we could design a system having redundant channels that can provides protection against the failure of a single channel.
Now redundancy relies on the assumption of independence (i.e. a failure occurring in any channel is unrelated). If independence is assumed then the probability of system failure is the combined probability of each individual and independent channel failing. But if our channels share a common component we can’t assume independence and the total reliability of the system starts to drop. And that’s the problem with software, if we use the same software component in each channel then its possible they share a common fault (3) (4).
According to N-versioning theory we get separate development teams to design and write software for each channel and because the teams are deemed independent (5) so to is the software. The only drawback to this approach is unfortunately that it actually doesn’t work and has been shown not to work empirically. To summarise the argument and evidence to date against N versioning:
- to perform the same function, the same specification must be used introducing the likelihood of common specification errors or common errors of interpretation,
- ‘independent’ software versions were found to fail dependently on ‘difficult’ inputs (Eckhardt & Lee 1985)
- experiments by Knight & Leveson (1985) also rejected the hypothesis of independence for n versioned programs,
- experiments by Eckhardt et al. (1989) also rejected the hypothesis of independence for n versioned programs,
- experimental data obtained by Knight & Leveson (1989) found that coincident faults could be either logically related or input domain related (6), and
- this finding of input domain forced failures by Knight & Leveson (1989) led them to conclude that correlated failures will occur even if developers use differing algorithms and make different logical errors.
Eckhart and Lee (1985) also found that small deviations from statistical independence would cause significant reductions in reliability thereby calling into question the cost benefit of developing two separate versions of a program.
So if N versioning has been shown to be ineffective, why do we still do it? Well to be uncharitable it does allow us to ignore software as a single point of failure in an otherwise redundant system. This single point vulnerability is a bit of a problem if one of the key safety goals your regulator sets is to prevent single points of failure, as is the case in the aviation community (7).
By using n-version software we can however advance a qualitative, e.g. hand wavy, argument as to the level of independence between channels (CAST 24). The only problem with this is that it’s really an end run around the actual problem, which allows one to ‘on paper’ answer the regulators requirements, but as we’ve seen not actually eliminate the potential for a single point failure.
To me the continued use of n-versioning despite ample evidence indicating its problematic nature, is evidence of an underlying malaise within the engineering community when it comes to safety critical software. The unpalatable fact is that there are actually limits to our ability and knowledge. But instead of recognising this lack, a wall of pseudo-scientific principle is erected so that industry and regulators can develop and certify applications in the belief that they have a methodology that will assure safety ‘out of the box’.
CAST 24, Certification Authorities Software Team (CAST), Position Paper CAST-24, Revision 2, Reliance on Development Assurance Alone When Performing a Complex and Full Time Critical Function, March 2006.
Echardt, D.E., Lee, A Theoretical Basis for the Analysis of Multiversion Software Subject to Coincident Errors, IEEE Transaction on Software Engineering, vol. SE-11, no. 12, pp. 1511-1517, December 1985.
Eckhardt, D.E., et al., An Experimental Evaluation of Software Redundancy as a Strategy for Improving Reliability, NASA Technical Memorandum 102613, May 1990.
Knight, J.C., and Leveson, N.G., A Large Scale Experiment In N-Version Programming, Proc. of Fifteenth International Symposium on Fault-Tolerant Computing, pp 135-139, Ann Arbor, MI, 1985.
Sghairi, M., de Bonneval, A., Crouzet, Y., Aubert, J.-J., and Brot, P., Challenges in Building Fault -Tolerant Flight Control System for a Civil Aircraft, IAENG International Journal of Computer Science 35:4, Advanced Online Publication November 2009.
Scott, R.K., Gault, J.W., and McAllister, D.F., Fault-Tolerant Software Reliability Modeling, IEEE Transactions on Software Engineering, Vol. SE-13, No. 5, pp. 582-592, May 1987.
Shimeall, T.J., and Leveson, N.G., An Empirical Comparison of Software Fault Tolerance and Fault Elimination, Proc. 2nd Workshop on Software Testing, Verification, and Analysis, Banff, July 1988.
1. If you want all the technical evidence see the work by Leveson & Knight (1985), Shimeall and Leveson (1988), Eckhardt et al. (1990), Scott, Gault, and McAllister (1987).
2. For example the Airbus aircraft flight control architectures all utilise N version voting architectures. Interestingly the Boeing 777 team elected to use different compilers but the same software.
3. The classic example of this being the Ariane 501 accident.
4. This problem is not just of software, for example a common hardware component may also introduce a common mode of failure into an N redundant system.
5. Through whatever arbitrary ‘diverse’ development technique.
6. Failures could be related because they were the same logical error or because the same set of inputs triggered differing logical errors.
7. As it is set by the FAA and JAA, see for example the recent A380 flight control system development program (Sghairi, et al. 2009).