Making the best of a bad situation: Lessons from an Intercom outage
Imagine starting your day with a page about elevated exceptions, diving into Datadog, and uncovering a lurking 32-bit integer limit in one of the most critical parts of your app’s data model.
Well, that’s what happened during a particularly chaotic, and unusual, outage for Intercom. What followed was a five-hour marathon incident response involving monkey patches, migrations, and feature flag gymnastics – but despite the stress, it was certainly educational.
“Learn what went wrong, how we fixed it, and the lessons we learned to prevent it from happening again”
There was inevitably lots to learn, and I share the key lessons in this talk at Rails World 2024.
You can hear the highlights (and lowlights) of that incident, including what went wrong, how we fixed it, and the lessons we learned to prevent it from happening again.
I also dive into some of the technical details we implemented after the fact to make our Rails app more resilient.
If you’re curious about how to avoid similar pitfalls – or just enjoy tales of debugging under enormous pressure – check out the video.