Wouldn’t life would be great if nothing ever went wrong and technology never broke?
Unfortunately, the reality is that things break all the time.
Your product probably breaks a little bit all the time and sometimes breaks big time (hopefully, much more occasionally).
Perhaps your website goes down for 4 hours during your busiest season and you miss $10M in potential revenue. Or, you discover that no payments have been processing for 2 days before anyone noticed.
In these serious to catastrophic cases, you really want to get to the bottom of why the problem occurred, with a view to making sure it never happens again.
That’s where conducting an effective Postmortem becomes vital.
The ultimate purpose of the Postmortem is to make sure the problem experienced never occurs again.
In order to do this, a Postmortem involves ferreting out root causes (more on this below). Bringing blame into the Postmortem process itself risks causing defensiveness and “ass-covering”. This defensiveness tends to obscure the true root causes.
This is not to say that blame isn’t important or can be avoided. Perhaps someone needs to be fired. But, keep the “blame” part separate from the Postmortem itself in order to get to the true root causes.
The Postmortem Process
I like to conduct the Postmortem process as a group, with all the stakeholders in a room in front of a whiteboard. It’s important that everyone impacted and/or responsible for the problem is involved and has a voice.
At a high-level, my process for conducting a Postmortem is as follows:
- agree and define the impact of the problem to the business – e.g. “we lost $10M in potential revenue”
- flush out all the causes of the problem down to their root, as far as possible
- agree a set of recommendations aimed at ensuring the problem never occurs again
Let’s use a contrived example for illustration. Imagine that I fell off my bike and broke my wrist. We start on the whiteboard with the impact – i.e. I broke my wrist.
My preferred method to analyze causes (step #2 above) is Why-because Analysis.
Why-because is a formalized process but don’t be put off – it can be used more casually with great success and you can add rigor as you become more familiar.
Why-because essentially involves repeatedly asking “Why?” and repeatedly answering with the “because” part. My 5-year old son is also great at this.
e.g. “Why did I fall off my bike?” “…because I hit a pothole.”
“Why did you hit a pothole?” “…because I wasn’t looking where I was going.”
“Why weren’t you looking where you were going?” “…because I was distracted”
…you get the idea.
Why-because is similar to other processes you may be familiar with like “5 Whys”. (I have found 5 whys to be insufficient because big problems typically have complex causes and the causal chains are often more than 5 levels deep.)
What you end up with at the end of the Why-because Analysis is a graph that shows you all the contributory causes that caused the impact on your business. More formally, when complete, the Why-because graph should include all the necessary and sufficient causes.
Continuing our example, here’s our Why-because analysis of why I broke my wrist:
Of course, you need to decide when you’ve gone deep enough and can stop asking Why? There is no hard-and-fast rule here – just use your judgement – but you don’t want to end up drilling down to “because the big bang happened” in every case.
One great thing about Why-because graphs like the one above is that you can test them to make sure they’re complete:
- for each box on the chart, you can ask, had this not occurred, would the problem still have occurred? If the answer is no, it’s a necessary condition.
- looking at all the boxes on the chart, you can ask, if all of these happened again, would the problem occur again? If the answer is no, your conditions are not sufficient and you’re not done yet.
Generally, big problems tend to have complex causes. This is because any reasonably mature organization will have checks and balances in place to avoid obvious and predictable failures.
Therefore, you will likely end up with a complex graph that includes a mixture of technical, operational and human contributory factors. It’s particularly important not to overlook or underplay the human factors since fixing the technical and operational issues alone will not avoid the problem recurring.
You can read more about Why-Because on Wikipedia.
The most important part of the process is to create a list of recommendations to act on, informed by the detailed understanding of the causes from the Why-because analysis.
Don’t forget the human factors – these are often the most important to address, e.g. additional training, more staff or better process.
Again, you can test your recommendations by saying, if we do all these things, is it highly likely to prevent this problem from recurring again? If the answer is no, you’ve not got the right recommendations.
Lastly, give each recommendation an owner who is responsible for taking action and be sure to follow-up.