Economy of mechanism and fail safe defaults
I’ve just finished reading the testimony of Phil Koopman and Michael Barr given for the Toyota un-commanded acceleration lawsuit. Toyota settled after they were found guilty of acting with reckless disregard, but before the jury came back with their decision on punitive damages, and I’m not surprised.
In an earlier post I alluded to the close parallels between security principles for systems and safety engineering, and after reading the testimony in the case of the Toyota Engine Control Module (ECM) there are two common security/safety principles that the Toyota software development team really should have tried to satisfy.
The first relates is what Saltzer and Schroeder call economy of mechanism or what you and I might call the KISS principle.
Keep the design as simple and small as possible.This well-known principle applies to any aspect of a system, but it deserves emphasis for protection mechanisms for this reason: design and implementation errors that result in unwanted access paths will not be noticed during normal use (since normal use usually does not include attempts to exercise improper access paths). As a result, techniques such as line-by-line inspection of software and physical examination of hardware that implements protection mechanisms are necessary. For such techniques to be successful, a small and simple design is essential.
Most engineers if asked would agree with the intuitive principle that simpler is better, but Saltzer and Schroeder point out that it’s vital for systems where we’re trying to verify a negative property such as security or safety. Because normal use doesn’t verify the absence of improper access paths or the presence of design faults we actually have to go out and look for them, behind every requirement and under every line of code. So from a purely practical point of view you need to keep these mechanisms as simple and small as possible, else you will never be done.
According to Barr’s testimony however, Toyota’s throttle code, which also contained the software fail safe functions, was anything but economical. The code scored over 100 using the McCabe complexity scale indicating it was effectively unmaintainable spaghetti, with the data structures being just as bad. Toyota also failed to separate the system fail safes away from non-safety functions, such as throttle control, which hugely increased the amount of analytical/inspection grunt required to verify system safety properties. Probably making it effectively impossible to verify their performance in practice.
The second is the principle of fail safe defaults that is the system needs to be designed so that unanticipated design errors or omissions tend to fail in a safe fashion.
Fail-safe defaults: Base access decisions on permission rather than exclusion. … A conservative design must be based on arguments why objects should be accessible, rather than why they should not. In a large system some objects will be inadequately considered, so a default of lack of permission is safer. A design or implementation mistake in a mechanism that gives explicit permission tends to fail by refusing permission, a safe situation, since it will be quickly detected. On the other hand, a design or implementation mistake in a mechanism that explicitly excludes access tends to fail by allowing access, a failure which may go unnoticed in normal use. This principle applies both to the outward appearance of the protection mechanism and to its underlying implementation.
Saltzer and Schroeder focus here on the value of a conservative approach which assures that where there are hidden design errors the system will still fail in both a highly visible and safe fashion, rather than in a hidden (and potentially unsafe) fashion. That first part, that failures should be visible is oft forgot and just as critical as the second.
From a safety perspective the conservative approach is likewise to assure that even for unanticipated failure modes the system will still fail in a safe fashion. Which also usually requires a layered approach to catching and dealing with system failures to ensure there are no ‘escapes’. Here Toyota’s design looks good on paper with data mirroring (redundancy), fail safe modes, watchdog supervisor and finally a separate the U6 chip based monitor CPU. But, in reality all these layers were subverted because they failed to ensure:
- that system failures were visible and recorded;
- all critical data was mirrored, thereby assuring the presence of single points of failure due to bit corruptions, most significantly in the OSEK operating systems critical arrays;
- separation (partitioning) of the fail safe (safety mechanisms) from the control functions;
- that the watchdog actually had teeth; and
- that the monitor function didn’t rely on unrealistic driver inputs before it would act, nor result in an unsafe failure mode (engine stall).
My conclusion after reading through the testimony is that the fundamental reason the design was so flawed was because Toyota was engaged in an exercise of safety theatre. That is they went about implementing a ‘safe’ design without reference to a set of interlocking and mutually supporting safety principles. As a result while the design looked fine on the surface, in practice the implementation lacked the necessary rigour and attention to detail that would have ensured that the the safety principles were met.