Analyze the recipe when you bake a failure cake
Fundamentally software engineering is about managing complexity and as sure as night follows day catastrophe follows complexity.
That’s the conclusion I’ve come to after many years of collecting bugs and war stories. A pattern I have noticed is that the worst problems were not strictly bugs at all – they were failures with no easy explanation and were extremely hard to diagnose. Looking back I realise what I was doing wrong was assuming that there was a single, simple cause to a complex problem. But they were actually caused by chains or sequences of events. The character of such sequences is that they are often obvious in hindsight but not beforehand. They are also very hard to understand as they are unfolding, and even after the dust has settled.
But we can can learn from what auditors and accident investigators from outside our industry do following a catastrophe; it’s not enough just to have theory or a codified body of knowledge, you have to drill and practice incident response. Looking at the way designers think about complex systems, there are also architectural decisions we can make to limit the likelihood of errors snowballing.
We can also learn from past failures by questioning what is known as the single cause assumption. When we are conducting a postmortem we need to find the recipe – all the factors that contributed – and then analyze it.
“Simplicity does not precede complexity, but follows it.” – Alan Perlis.
In this presentation I gave at our recent Inside Intercom Engineering event, I describe a hypothetical web production system and how it can spiral into a catastrophic state due not to a single cause, but the interaction of smaller issues and failures. I also mention five strategies that we can use to help with complexity. Complexity can’t always be eliminated, but it can be managed.
Further reading:
Normal Accidents: Living with High-Risk Technologies by Charles Perrow
Sources of Power: How People Make Decisions by Gary Klein
The Checklist Manifesto: How to Get Things Right by Atul Gawande
Main image Armando G Alonso