Watchdogs, the unsung heroes of safety critical design


And not quite as simple as you think…

The testimony of Michael Barr, in the recent Oklahoma Toyota court case highlighted problems with the design of Toyota’s watchdog timer for their Camry ETCS-i  throttle control system, amongst other things, which got me thinking about the pervasive role that watchdogs play in safety critical systems. The great strength of watchdogs is of course that they provide a safety mechanism which resides outside the state machine, which gives them fundamental design independence from what’s going on inside. By their nature they’re also simple and small scale beasts, thereby satisfying the economy of mechanism principle.

Watchdogs can, at least in principle, provide us with a highly assured defence against undetected design faults or circumstances that we have not anticipated, in other words they’re a good defence against ontological hazards. I’d stress though that WDT won’t catch all errors, but they’re very good for certain lockup states, or combinations thereof, and can form part of a broader never give up strategy. Watchdogs can also provide protection in concurrent systems via the an associated watchdog timer task (1). Although you do need to be very careful about the trade off between capability and robustness/economy.

Despite their abilities watchdogs don’t in my opinion receive the attention they deserve, perhaps because many designers see them as an admission that their code could be incorrect. And a poorly implemented watchdog is worse than useless as Toyota found out. So the following questions are intended to be a checklist of issues to consider when designing your watchdog, and no it’s never quite as simple as you might imagine. Herewith…

The check-list

  1. What’s the maximum time to respond?
  2. How does the maximum time relate to the time the system can tolerate before it’s damaged?
  3. What’s the minimum time needed to prevent false triggering?
  4. If the watchdog is an internal on chip device has its control registers been protected?
  5. Have you considered software timing variability due to interrupts, context switching and caching?
  6. if an internal watchdog have you considered the scenario of a software bug resetting the timer?
  7. Do you use a single bit or an ISR to reset the timer, why?
  8. If the timer requires two back to back writes to reset have you considered an interrupt occurring between them?
  9. If you’re resetting the timer in multiple threads or processes how does this correlate to the timer interval?
  10. If an MMU is present have you masked all access to the watchdog, other than by a designated watchdog task?
  11. if multiple tasks can access the watchdog timer how do you assure a code error will not?
  12. How do other tasks update a watchdog task with their status? On start? Close? Each time through a loop?
  13. Does the timer initiate a full system reset or a Non-Maskable Interrupt (NMI)? 
  14. If you’re using an NMI, justify why you think you can trust the CPU after a watchdog has fired?
  15. Have you reset all peripherals? If not what are the potential consequences?
  16. Are the peripherals resets software or hardware, if software why would you trust them after the watchdog has fired?
  17. How trusted is the clock used by the watchdog?
  18. What sort of system growth do you need to allow for?
  19. What margin do you need to allow for failure of the timer itself due to an SEU upset?
  20. Do you need to be able to disable the timer for test, diagnostics or software updates?
  21. Do you need to disable the timer operationally? If it fails?
  22. how reliable is the power supply to the timer, what happens if the supply is disabled?
  23. What happens when the power cycles?  
  24. Have you stored the fact that the timer has triggered (and the number of times)?
  25. Have you used a saturating counter?
  26. Is timer history stored in non-volatile (and always powered) memory?
  27. Have you considered the timer as a SPOF?
  28. What is the disabling mechanism, stored or a hardware discrete command? Why?
  29. For stored commands under what scenarios the timer should resume? After test? Comms loss? If a SW load got bad?
  30. For discrete commands what’s the mechanism and why do you trust it?
  31. What’s the default timer state (enabled or disabled)? 
  32. If it’s default enabled have you considered the tradeoff between simplicity and autonomy vs recovery difficulty?
  33. If it’s default disabled have you considered how to reset the timer and any difficulties?
  34. Is there a jumper used to enable or disable the watchdog, is it left in?
  35. If a jumper is used does it allow the watchdog to run in isolation during testing?
  36. When is the jumper removed and how do you know?


1. To which other tasks write their status and which then resets the hardware timer based upon the status of these tasks.