For a long while, possibly forever, the assertion that it was impossible test for the reliability of a system meeting the mythical 10e-9 failure per hour has dominated thinking about high integrity systems. But maybe this is just not true. You see this statement itself relies on a critical assumption about the distribution of failure inter arrival times. Traditional reliability theory, wherein this idea comes from, is based on the assumption of a memoryless process where the likelihood of failure is constant. However this is just an assumption and we have no reason to believe (a priori) that is true, yes it makes the maths easier, but so what? That doesn’t mean it’s actually ‘true’ even if it is convenient.
What if instead we posit that the distribution of the inter arrival times of failures is actually heavy tailed, thus as time passes failure gets less likely and that instead of the hazard rate increasing as our exposure increases it actually decreases! The canonical example is expecting a response to an email, it’s quite likely in the first minutes and hours but as time passes that response gets less and less likely. Train delays are similar, we’d expect a train to arrive within a couple of minutes but as the wait gets longer and longer, well it’s arrival gets less and less likely. Technically this is called a heavy tail explosion, and as a model it may well fit better the occurrence distribution of latent design faults.
If failure inter-arrival times (we’re focusing on design faults in particular) follow a heavy tail distribution then guess what? Turns out that testing could actually be competitive against other assurance strategies in terms of cost per pound of assurance. This of course also throws into doubt a major justification for the integrity level, design assurance industry. All because of a simple assumption about failure distributions.