How your engineering team can get more from incident reviews
You can’t build software without encountering incidents – from critical bugs to full-blown outages, dealing with incidents are an inevitable part of the process.
As a result, you’ll find no shortage of articles telling you how to write a review – or as they’re commonly known, a post mortem – of your incident. These top tips are widely practised:
- Avoid blame
- Focus on the details
- Ask why multiple times
This is all good advice, and following it will probably lead to an insightful report. But while there are plenty of tips for writing a good incident report, very few of them tell you how to facilitate the critical meeting where that report is assessed and discussed.
Ask five key questions
In the past, we asked a set of five questions during incident reviews at Intercom:
- Do we know what happened?
- Do we feel confident about how we detected this incident?
- Was it easy and straightforward to mitigate, rather than difficult and slow?
- Do we understand the takeaways, aka the lessons learned?
- Do we feel confident that we can prevent this from reoccurring?
Each question was designed to facilitate discussion. If we weren’t confident, why not? What could we do better? What could our systems do better? If we were confident, what did we do right? How would we apply that to the rest of the org? The questions were orientated to encourage a growth mindset, and push us to learn and improve. In reality, however, that’s not what the questions did.
Although we made it clear at the start of every meeting that blame was neither constructive nor welcome, the five-question format made presenters feel like they were on trial. They were in the hot seat, at the mercy of anyone with a question. This made the review a high-pressure, isolating experience for the engineer presenting. It’s hard to learn when you, or others around you, don’t feel safe.
The lessons and takeaways weren’t clear enough
Another problem with the format was that it didn’t create natural openings in conversation for other engineers to jump in. By focusing on a single event with set questions, we often overlooked commonalities and trends across incidents. This made it difficult for other engineers to understand what they could take away from these meetings and how the lessons might apply to themselves and their team.
The result was a slow and often unproductive meeting where the facilitator tried to pull insights from a reluctant guest. Teams became increasingly reluctant to engage in the review process, and meeting attendance dwindled.
“We wanted to open up the conversation, encouraging open dialogue and collective learning”
Experimenting with the meeting format
Not only was this meeting hard to navigate for the engineers, it was incredibly tricky to facilitate. So we decided to start experimenting.
We looked at the main improvements we wanted to see:
- We wanted everyone to feel truly safe
- We wanted to open up the conversation, encouraging open dialogue and collective learning
- We wanted to observe patterns and move beyond isolated incidents to observe recurring trends
- We wanted our weekly meeting to feel valuable, to be a weekly highlight for engineers
We made small changes week on week – we wanted the freedom to learn from what worked and what didn’t on a weekly basis, and to continuously tweak the meeting structure and format. As always, we shipped to learn by trialling new approaches as we went, and sticking with the ones that worked.
Flex your facilitation muscles
Facilitation is a learned and practiced skill – hard to perfect but hugely rewarding when done right. We consulted Meg Bolger and Sam Killerman’s Unlocking the Magic of Facilitation: 11 Key Concepts You Didn’t Know You Didn’t Know for inspiration, and used the authors’ three desired outcomes as goals.
- Time flies
- Everyone stays engaged
- Everyone grows, even the facilitator
One of Bolger and Killerman’s top tips is to be upfront about what you want from your participants, so at the start of every incident review we reminded our audience why we were all here. As the weeks progressed we started to add more to this speech. We made it clear that this meeting was for them, and that we would all benefit from their engagement.
“We encouraged everyone to practice good video meeting etiquette”
This all happened while the team was adapting to working from home, so we encouraged everyone to practice good video meeting etiquette. We asked for cameras to stay on as much as home office conditions allowed. We toyed with moderating through raised hands and questions brought to the chat, but found that folks were courteous enough to manage speaking without interrupting when they had a point to raise.
Let participants guide the conversation
The question and answer format was not working, that was clear. So how could we guide constructive conversations that brought about real learning? It has always been team practice to assess submitted reports before each review, to make sure they meet our standards for a good report. We decided to capitalize on this process and use this review as an opportunity to extract common themes across multiple incidents.
“We set a goal to generate at least four strong talking points”
We set a goal to generate at least four strong talking points: two based on the particular incident under review and two that addressed common themes. At the start of every meeting we would launch a poll with each of our talking points and ask participants to vote for the topic they’d most like to discuss. We’d start with the topic that topped the poll and go from there.
Sometimes these topics would serve as jumping-off points, allowing us to expand and follow interesting trains of thought. In other cases we would simply progress through the top voted points. We kept an “other” option available to keep the floor open and give everyone a chance to bring their learnings to the table.
Feedback informed our approach
To make sure these techniques were working we started sending automated surveys to our incident review Slack channel after each review. The poll asked participants to anonymously agree or disagree with the following statements:
- This meeting was a good use of my time
- I think we discussed the right things
- This meeting felt safe and constructive
- I learned something new in this meeting
We included an open-text question asking how participants felt we could improve. The feedback was really encouraging; most participants agreed with the above statements and we started to see an average weekly score of 4.5 out of 5. Slowly but surely, attendance at reviews started to climb. By the end of the quarter, attendance had increased by 50% – a great return for a relatively small effort.
“The aim of our program is to build an honest, accountable and constructive culture”
Incident reviews are a mechanism of change in any company
The progress we’ve seen to date has been really promising. We still have our outstanding problems, like how to tie these learnings into actionable items on our roadmap, or how to keep attendance strong in particularly busy weeks. But the aim of our program is to build an honest, accountable, and constructive culture within teams that care about our customer experience. The way your organization processes failure is a mechanism of change – if that mechanism is broken, boring, or painful, then your organization can’t reach its full potential.
Are you interested in joining the team at Intercom? We’d love to talk to you.