Shipping fast and safe: Building a culture of low-risk learning
Main illustration: Axel Kinnear
Here at Intercom, we believe in shipping as quickly as possible.
It breathes life into your engineering team, and teams across the company, as customer issues and requests are resolved quickly and efficiently. But here’s the caveat: you can’t just ship fast, you have to make sure your team is equipped to ship safely.
Why do we ship fast?
We’ve always shipped to learn at Intercom. We believe there’s only so much you can do with automated tests and pre-production environments. Production is the only place where your code, infrastructure, and customers come together to represent objective reality. It’s only in production that you can truly validate the performance and correctness of your code, and learn how your customers use your product. Shipping speed is the key to success; it allows high-performing teams to learn and iterate faster.
As an organization, we foster a culture of iterative delivery, and we continuously invest in our delivery pipeline to make deployments delightful and easy. We encourage engineers to ship the smallest thing (we call them cupcakes) as quickly as possible, and every new hire on our engineering team ships a feature within their first week. With over 50 deployments per day, it generally takes less than 12 minutes to ship new code.
“Most of the problems awaiting you in production are so-called ‘unknown unknowns’ – things you can’t predict or plan for”
But, as always, with great power comes great responsibility. Every new change increases the risk of a failure. After all, most of the problems awaiting you in production are so-called “unknown unknowns” – things you can’t predict or plan for. Rather than creating artificial barriers to shipping, we minimize risk by building resilience – investing in observability, ensuring fast recovery, and reducing the blast radius of potential issues, while maintaining effective communication across the board.
Here is a set of techniques we employ at Intercom to enable product engineers to safely learn from the production environment.
Be available after shipping
Be available after you ship a risky change – it’s your responsibility to make sure your change gets out safely. Monitoring and alerting systems aren’t perfect, and it can take some time for the customers to notice problems and report them to the support team. You have in-depth knowledge of the change and the context around it, and will be able to spot problems even before automation kicks in.
We encourage engineers to observe and test the changes they ship in production using little tricks like Slack notifications upon deployment completion. Engineers learn how to assess and mitigate negative customer impact through regular lightweight on-call shifts during office hours.
Ship instrumentation first
We all deal with business-critical legacy systems from time to time. Even the simplest change to those systems could be risky because of your lack of context and the potential blast radius. So how do you start learning safely? One option is to ship instrumentation first – even a simple log line can save you hours of meticulous planning. Understanding what triggers the code path you are about to change helps to identify potential dependencies that may not be obvious from the code itself. Knowing the scale of the traffic in advance will help to define desired performance guarantees.
At Intercom, we invest in our auto-instrumentation, exposing high-quality tracing telemetry data out of the box. We encourage engineers to ship custom instrumentation even before the code itself is written – the data gathered is a valuable input into the tech planning process.
Use feature flags
Being able to turn something on or off for a subset of your customers is a superpower. It allows you to learn and iterate safely while minimizing the blast radius. At Intercom, we use feature flags: a simple mechanism to control which customers have access to a feature.
“Knowing you can disable a change with a click of a button creates a safe haven for engineers to learn from production”
Knowing you can disable a change with a click of a button creates a safe haven for engineers to learn from production. Feature flags change the way engineers think about product development, from internal experiments to public betas. During an outage, feature flags can also be used to turn off non-critical components, helping the system to recover faster.
Avoid leaving too many feature flags hanging around for too long – it makes it difficult to understand how exactly an application works. At Intercom, we’ve built an automated process to surface stale, globally-enabled feature flags and notify the corresponding product teams.
Ship to a small subset of the traffic
While feature flags help to reduce the blast radius, sometimes it’s not easy to choose a representative sample of your customers to uncover all of the “unknown unknowns.” Consider shipping to a small, random percentage of your traffic to reveal the edge cases while keeping the impact relatively small.
Getting that random selection is the key and there are multiple ways to do it. At Intercom, after a “pull request” is reviewed and approved, you can ship the change to just a single “canary” instance running your code to discover any issues.
Ship the “read” path first
It is relatively easy to undo a change that does not affect the way data is written (“write” path). Our single-click rollback mechanism can put the previous deployment back into service straight away. In contrast, undoing a change that has altered the “write” path might result in a different type of outage – the previous version of the code may not recognize the data produced by the reverted change.
“It’s always a good idea to share a document detailing the rollout and rollback plans with the team”
To stay on the safe side, we try to ship changes to the “read” path first to teach the system how to consume the data in both the old and new formats. Only then will it be safe to consider shipping a risky change to the “write” path.
Document and share your plan and actions
When dealing with a risky multi-step manual operation (such as a change to the infrastructure), it’s always a good idea to share a document detailing the rollout and rollback plans with the team. We use GitHub Issues (we call them tracking issues here at Intercom) to share context and document every step we take during the rollout in comments. This documentation helps team members to catch up with the incident quickly, and builds an excellent library of learning material – we often refer to them long after an operation is completed.
Learn in a safe environment
These techniques create a safe environment where Intercom’s engineers can provide real value to our customers, learning as they go. We’re constantly revisiting and improving our deployment process to ensure we continue to ship as quickly and safely as possible. If you’d like to be part of our high-performing team, come and join us – we’re hiring!