Learning by fixing — the value of triage engineer rotations
Main illustration: Rewina Beshue
Breaking things and fixing them again is one of the best ways to learn. I learned this lesson early, thanks to my younger sister and her Japanese robotic toy dog. Somehow, I convinced her to let me take apart her robodog so I could see how it works.
“I’ll put it back together. Don’t be such a baby!”
How wrong was I? It would probably have been easier to put back together a Volkswagen Beetle than this toy dog. There I was, sitting clueless on the floor, surrounded with plastic parts and electronics. My sister was crying and I was sweating, trying to fix everything before our parents returned home. In the end, just in time, the dog was put back together (albeit with some mysterious spare parts hidden in the bin).
Fixing things and building things are very different to one other
Still, I learned a lot that day. I learned that engineering is hard. I learned that breaking things feels bad. I learned that trying to fix things can be stressful. I learned that fixing things and building things are very different to one other. But above all, I learned that trying to fix things is actually a great way to learn.
Introducing triage engineer rotations
I often think of that incident because I’ve found many of those lessons resonate with the way we do things at Intercom, particularly in the way we separate the different processes of building and fixing.
Recently, Brian Scanlan wrote about how we developed an out-of-hours on call team to deal with emergencies and ensure the best possible uptime for our product while avoiding burnout among engineers.
But we also have a way of optimizing our on call process during the working week to allow engineers to focus on building rather than being distracted by fixing issues.
We introduced the idea of having a triage engineer rotation. Every week, we nominate a triage engineer for the team, who serves to shield teammates from distractions during the working hours. Their teammates, in turn, are able to deeply focus on their goals. But the benefits go much further than fostering better focus.
Triage engineer mission and expectations
Your main mission as the triage engineer is to shield teammates from distractions. That means being the first one to answer any messages regarding the team and the systems you own. You report issues status to the team in the morning stand-up and inform them of anything relevant.
Also, the triage engineer should manage high-priority issues, investigate new issues, and, if time permits, fix low-priority issues. If some issues require more planning to be resolved, triage engineer will suggest them as next week’s tasks during a planning meeting.
Triaging issues
Triaging is nothing more than determining the priority of an emergency. The prerequisite for this process is that each team should have a set of categories covering their area of responsibility. These can be created as labels or tags within your issues tracking software. We use GitHub for issues tracking.
There are several steps we take while triaging an issue, and the workflow looks like this.
How to assess priority of issues
An important part of the process is assessing the priority of issues as they arise. In general, we follow these prioritization guidelines:
- P1
A primary workflow in Intercom is broken, or not working as expected. The relevant team should immediately work on this above all other commitments. A P1 should never be open and not being investigated. - P2
A specific feature or part of Intercom isn’t working as expected. However, it doesn’t prevent users from completing primary workflows. At the latest, this should be scheduled into the relevant engineering team’s following week’s plans. - P3
Minor items where something is technically broken i.e. things are not working as designed. Engineering teams should react on a case by case basis depending on the other priorities.
At the end of the week, organize a short issues hand-off. Use this meeting to inform the next on call triage engineer about the important issues left open or possible issues arising. That’s it! You’re now ready to handle any issue flying your way.
Time for triage: Investigating issues
There’s a lot of prejudice among engineers against fixing the boring old issues. Everyone wants to work on a fancy new project. However, it can actually be a lot of fun playing detective with an issue. Digging around and exploring the existing systems is a great opportunity to learn. Investigating issues will force you to actually read and understand other people’s code.
Engineers realize the value of not breaking things
That brings me back to a few of the big lessons I learned trying to fix my sister’s robodog. Fixing things is one of the best ways to learn. It’s also incredibly satisfying. But the triage rotation also gives engineers a real system of ownership and responsibility. If I didn’t have to fix that robodog, I wouldn’t have learned exactly how bad it feels to break something.
By regularly working triage rotations, engineers realize the value of not breaking things, or more precisely, the value of building them to be stable and resilient. They get a broad context of what the team owns and how things work, and where weaknesses can occur.
It’s hard to decide an engineer should exclusively focus on being the distraction shield for a week, but the dividend is double-sided – the other teammates are able to do deeply focused work without the distraction of issue alerts, while the insights you get from doing the on call triage rotation are invaluable.
A lot of this sounds like common sense, but a surprising number of companies don’t manage to implement an approach like this. The cost isn’t just in unresolved issues, but also in constantly broken focus. That makes it harder not just to fix things, but to build them in the first place.