5 Tips for a Great Post-incident Review

Picking up the pieces after a catastrophe is the tried and tested way to learn and improve.

Sam Cooper
5 min readApr 8, 2023
Photo by Daniel Eledut on Unsplash

The post-incident review is a vital part of almost every software engineering team’s process.

After an incident that stops a piece of software from working, we go in and examine what went wrong. How can we make it less likely to break like that in future? How can we recover quicker next time something like that does happen?

The point of the process is to learn and improve. But running a good post-incident review takes skill and judgement. Here are five tips that will help you perfect your PIR game.

1. Don’t jump the gun

The post-incident review is called “post-incident” for a reason. During an incident, restoring normal service needs to be the only priority. Focus on what can be done here and now to get the system working again.

Questions and speculation about how the problem could have been avoided in the first place are probably just a distraction that risks delaying the fix.

Not only that, but tensions are likely to be high as people work under time pressure to get the incident resolved. In those conditions, it’s going to be harder to remain objective and avoid pointing fingers. It might be a cliché, but no-blame culture really is important. Employees who feel safe from blame are more likely to speak up when they see opportunities to improve safety.

Only start thinking about the post-incident review once the system is functioning again and the incident is over.

2. Know what you’re looking for

Have you seen Sully? It’s a great movie, but a horrible example of a post-incident review. The reason it’s bad — and unrealistic — is because the investigators are focusing on the wrong things.

In a real aviation investigation, the focus is squarely on safety. Nobody’s talking about insurance payouts or quizzing people on a witness stand. Instead, the only goal is to identify opportunities to improve systems and processes.

Post-incident reviews in software engineering should be the same. Life only moves forwards, and what you’re looking for are ways you can increase safety and decrease response time in the future. Structure the review around these goals — don’t let it degrade into “what-ifs” or accusations.

Remember, analysing human decision making is not the objective. Humans are imperfect: that’s a given. Our actions and decisions are flawed, and always will be. As engineers, our job is not to eliminate the human factor, but to design systems and processes within which human imperfections can exist safely. When things go wrong, there will always be human mistakes involved, but we need to look past them and address the systemic failures that allowed those mistakes to become problems.

3. Know your limits

A post-incident review can cover a lot of ground. However much you’ve prepared ahead of time, you should expect other people to bring up things you hadn’t thought of too. By the end of the process, you’ll have a list of notes and suggestions as long as your arm.

Don’t be tempted just to write them all up as action points and call it a day.

The reason is simple: the more actions you come away with, the less likely they are to get done. Software engineering moves at a relentless pace, and you already have more tasks than time. How many times have you seen a corrective action gradually slip off the bottom of the to-do list?

Prune the list of actions ruthlessly. Remove anything that seems vague or hard to measure. Two solid actions is a good target to aim for. Three is pushing it, and four is too many.

Even one action point is fine. So long as you take a meaningful step to improve safety, the post-incident review has served its purpose.

4. Failure is inevitable

If you don’t correct every single problem you identify, aren’t you just knowingly leaving the door open for more problems?

Every piece of software has dozens of ways it could be improved — from tactical tech debt that needs paying down to lax error-handling that needs shoring up. Most of the time, there’s no way you’re going to be able to fix all of it.

But maybe that’s okay.

Post-incident reviews are as much about preparing for the next incident as they are about avoiding it. Because there will be a next incident. Even if you continually improve your safety practices and systems, what you’re really doing is enabling yourself to build bigger, faster and better. And the bigger, faster and better you build, the more complex the system becomes and the more things there are that might go wrong.

Don’t treat incidents as failures and post-incident review as a punishment for them. Software moves fast, and incidents and recoveries are likely always going to be part of that forward motion. The only way to make the code completely safe is to shut it all down and go home!

5. Socialize!

When you run a post-incident review, invite as many people as possible to participate. Keeping the process open and inclusive creates a great culture that will in itself do a lot to improve safety. And if you write up a report, share it as widely as you can. Every department in the company can benefit from the lessons learned.

But don’t just leave it there.

Be a fly on the wall, and go to other people’s post-incident review meetings! You don’t have to contribute — in fact, you might be better off staying quiet if the system under discussion isn’t one you know well — but you can still learn something.

Seeing the kinds of things that have gone wrong with other people’s systems helps you be ready for the ways your own software will fail. And seeing the corrective actions other teams have taken could help you put a process in place that will prevent a failure in future.

But perhaps most importantly, attending other post-incident reviews will help you improve your own PIR game!

--

--