Evolving Intercom’s database infrastructure

Director of Engineering, Intercom

October 14, 2024

Intercom is rolling out a major evolution of our database architecture moving to Vitess – a scalable, open-source MySQL clustering system – managed by PlanetScale.

For many years, Intercom has used Amazon Aurora MySQL as our default database. With the addition of our custom sharding solution for high scale data, Aurora MySQL has allowed us to scale our databases with relative ease. It has supported hundreds of terabytes (TB) of data, 2 million reads per second, and tens of thousands of writes per second. Aurora MySQL has served us well as the source of truth for the majority of Intercom’s most critical production data.

“We deeply understand the importance of reliability because we experience it firsthand”

For our customers, when Intercom is down, critical parts of their business are affected. They expect flawless uptime, and so do we, even accounting for unforeseen disruptions or planned database maintenance. Our own teams – including Customer Support, Sales, Product, Engineering, IT, and more – rely heavily on our platform every day. An outage doesn’t just impact our customers; it impacts us directly. We deeply understand the importance of reliability because we experience it firsthand.

In late 2023, as we reviewed our database architecture, several factors led us to seek improvements: enhancing the customer experience, addressing operational friction, and keeping pace with a shifting database landscape.

Our review surfaced these goals:

Eliminate downtime due to database maintenance and writer failovers.
Reduce the complexity and cognitive load of working with databases across engineering teams.
Streamline the migration process and improve the latency of running large-scale database table migrations.
Achieve straightforward, low-effort scaling of MySQL for the next decade.

We aim to build “boring” software and are committed to running less software, choosing to build on standard technologies, and outsource the undifferentiated heavy lifting. With this in mind, we decided earlier this year to move our database layer to Vitess managed by PlanetScale within our AWS production accounts.

Why Vitess?

Vitess is a MySQL-protocol compatible proxy and control plane for implementing horizontal sharding and cluster management on top of MySQL. Originally developed by YouTube and now used by companies such as Etsy, Shopify, Slack, and Square, Vitess combines MySQL features with the scalability of NoSQL databases. It offers built-in sharding capabilities that enable database growth without necessitating custom sharding logic in the application.

Vitess automates tasks that impact database performance, such as query rewriting and caching, and efficiently handles functions like failovers and backups using a topology server (a system that keeps track of all the nodes in the cluster) for server management. It addresses the lack of native sharding support in MySQL, facilitating live resharding with minimal downtime and maintaining up-to-date, consistent metadata about cluster configurations. Importantly, it also acts as a connection proxy layer which would eliminate the majority of database related incidents we’ve had in recent years. These features effectively provide unlimited MySQL scaling.

Why PlanetScale?

PlanetScale builds upon Vitess by offering a managed platform that provides an exceptional developer experience and handles the undifferentiated heavy lifting of managing the underlying infrastructure. Their expertise, which includes core Vitess team members, allows us to benefit from advanced features like advanced schema management, database branching, and automated performance optimization.

The details around scale and challenges below largely relate to our US hosted region – the infrastructure in our European and Australian regions is similar but at a smaller scale. PlanetScale will be rolled out to all regions.

Supporting high scale: 2011 to 2024

As Intercom scaled, we adapted our database strategies in three main ways:

Get a bigger box: In the very early days of Intercom, scaling our databases was straightforward – we simply upgraded to larger and more powerful database instances. This vertical scaling approach allowed us to handle increased load by leveraging AWS’s flexible and ever improving instance types. With a maintenance window, we could move to instances with more CPU, memory, and I/O capacity as our data and traffic grew. However, this strategy has its limits. There’s only so much capacity you can add before hitting the ceiling of what a single machine can handle, both in terms of hardware limitations and ability to perform certain operations such as database migrations.
Functional sharding: To move beyond the constraints of vertical scaling, from 2014 we started implementing functional sharding within our architecture. This involved splitting our monolithic database into multiple databases, each dedicated to specific functional areas of our application. For example, we separated our conversations table out into its own database. By distributing the load across dedicated databases, we reduced contention and improved performance for specific workloads. This has its drawbacks, cross-database queries became more complicated, and maintaining data consistency across different shards required additional coordination through multi-database transactions. As AWS introduced larger and more powerful database instances, this scaling strategy has remained relevant.
Move to RDS Aurora: Soon after AWS released RDS Aurora in 2015, we eagerly migrated to RDS Aurora from the original RDS MySQL offering. Aurora’s architecture decoupled storage from compute, and allowed us to easily scale-out using read-replicas, avoiding replication lag and other problems that existed in traditional MySQL implementations at the time.

Sharding per customer

As our customer base and data continued to expand significantly, we faced database scalability challenges that could no longer be addressed by vertical scaling or functional sharding. To overcome this, we implemented customer sharding by horizontally partitioning our data based on customer identifiers. This approach allowed us to distribute the load more evenly across multiple database clusters and scale horizontally further by adding new database clusters as needed. Effectively, each customer would have their own database for high scale data (e.g. conversations, comments, etc.).

“Our sharding solution enabled us to handle billions of data rows and millions of reads and writes per second without compromising performance”

Building our own sharding solution was a substantial undertaking which we completed in 2020. We dedicated a team to develop a tailored solution using technologies we were already familiar with. This enabled us to handle billions of data rows and millions of reads and writes per second without compromising performance. Thanks to this setup, we were now able to migrate large-scale tables that we hadn’t been able to touch for years, unlocking easier and faster feature development.

Managing this sharded environment introduced new complexities. For example, our application had to incorporate logic to route queries to the correct shard and simple migrations, for example adding a new table, would take days to complete. This was better than not being able to change these tables at all, but still not optimal.

What problems did we see in our current setup?

Connection management

Intercom operates a Ruby on Rails application with its primary datastore being MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct AWS RDS Aurora MySQL clusters.

A problem of this architecture is connection management to MySQL databases. There are limits on the maximum number of connections that can be opened to any individual MySQL host, and on Amazon Aurora MySQL the limit is 16,000 connections. Intercom runs a monolithic Ruby on Rails application, with hundreds of distinct workloads running in the same application across thousands of instances, connecting to the same databases.

“The use of ProxySQL allows us to scale our application without running into connection limits of the RDS Aurora MySQL databases”

As each running Ruby on Rails process generally needs to connect to each database cluster, the connection limit is something we had to engineer a solution for. On most of the MySQL clusters, the read traffic is sent by the Ruby on Rails application to read-replicas, which spreads the connections out over a number of hosts, in addition to horizontally scaling the query load balancing across the read-replicas.

However, for write requests, we need to use a different approach, and in 2017 we rolled out ProxySQL to put in front of the primary writer nodes in each MySQL cluster. ProxySQL maintains a connection pool to each writer in the MySQL clusters and efficiently re-uses connections to serve write requests made by our Ruby on Rails application. The use of ProxySQL allows us to scale our application without running into connection limits of the RDS Aurora MySQL databases.

In the last year, we’ve experienced a number of outages related to our use of ProxySQL. These issues arose particularly when we attempted to upgrade to ProxySQL 2.x and utilize new features like its integration with RDS Aurora read replicas, which led to instability and outages.

Database maintenance

Maintenance windows are a necessary evil of most database architectures, and nobody loves them. For many of our customers, when Intercom is down, large parts of their business is down too. This is increasingly relevant as Intercom builds out features such as Fin AI bot, which can resolve large amounts of conversations for our customers.

Maintenance windows are something we’ve avoided unless absolutely necessary and when needed, run the majority of them at the weekend in order to reduce the impact for our customers. With AWS Aurora, any upgrades or planned instance failovers (for example, for increasing the size of a database instance) required maintenance windows with customer impact ranging from five to seventy minutes.

For instance, during our upgrade from Aurora 1 to Aurora 2, we conducted ten maintenance windows across our regions, each causing actual disruptions between twenty and seventy minutes.

We knew we needed to do better here, and remove the need for maintenance windows entirely.

Intercom’s database architecture 2024 and beyond – enter PlanetScale

While these methods have allowed us to scale with relative ease, the database landscape has changed dramatically. Compared to 2019, when we decided on our custom application sharding approach, there are now more options for building practically infinitely scalable databases appropriate for Intercom.

Embracing Vitess and PlanetScale

To address the limitations and complexities of our existing database architecture, we have embarked on a journey to adopt Vitess managed by PlanetScale. This transition represents a significant evolution in our approach to database management, aiming to enhance scalability, reduce operational overhead, and improve overall availability for our customers. We have already migrated several databases and have many more to transition in the coming months. The benefits we’re already seeing include:

Simplifying connection management

One of the immediate benefits of Vitess is its ability to act as a single connection proxy layer through its VTGate component. VTGate is a stateless proxy server that handles all incoming database queries from the application layer. It intelligently manages connection pooling and query routing, effectively multiplexing a large number of client connections over a smaller number of backend connections to the MySQL servers.

“VTGate allows us to scale our application seamlessly without worrying about connection constraints”

By centralizing connection management, VTGate eliminates the 16,000 connection limit per MySQL host that we previously faced with Aurora. This removes the need for ProxySQL in our architecture, reducing a massive source of complexity, and potential points of failure. VTGate also provides advanced query parsing and can route queries based on the sharding key or even handle scatter-gather queries across multiple shards when necessary. This allows us to scale our application seamlessly without worrying about connection constraints or overloading individual database instances.

Zero-downtime maintenance and failovers

Vitess offers advanced features like seamless failovers, which are critical for eliminating customer downtime during maintenance operations such as software upgrades and changing instance sizes. Its built-in failover mechanisms ensure that if a primary node goes down, a replica can take over almost instantaneously without impacting ongoing transactions. This aligns perfectly with our goal of providing flawless uptime and eliminates the need for extended maintenance windows that disrupt our customers’ operations. With the clusters we’ve already migrated, we can refresh the database instances without any noticeable impact on our customer-serving metrics.

Native Sharding Support

Perhaps the most significant advantage of Vitess is its native support for horizontal sharding. Unlike our previous custom sharding solution, Vitess abstracts the complexity of sharding away from the application layer. Our engineers no longer need to write custom logic to route queries to the correct shard; Vitess handles it automatically based on the sharding scheme we define.

“This reduction in cognitive load allows our teams to focus more on delivering new features and less on managing database intricacies”

In time, we will also be able to combine our functionally sharded databases into a single logical database thereby reducing the complexity we introduced to maintain data consistency across the databases. For example, currently, if a new comment is created, three individual databases must be kept in sync. This reduction in cognitive load allows our teams to focus more on delivering new features and less on managing database intricacies.

Streamlined migrations and scalability

Running large-scale database migrations has been a pain point due to the time and complexity involved. Migrations on our largest non-sharded tables can take months to complete. Vitess addresses this with its online schema change tools operating on sharded data, enabling us to perform migrations with minimal impact on performance. Additionally, scaling horizontally becomes a straightforward process. Need more capacity? Simply add new shards, and Vitess will manage the data distribution without requiring significant changes to the application.

Partnering with PlanetScale

By choosing PlanetScale to manage our Vitess deployment within our AWS production accounts, we leverage their expertise and the contributions of the Vitess core team members they employ. PlanetScale provides a developer-friendly experience and takes on the undifferentiated heavy lifting of managing the underlying infrastructure. This partnership ensures that we benefit from best-in-class database management practices while allowing us to remain focused on what we do best: building our AI-first customer service platform for our customers.

One of the standout features PlanetScale offers is its advanced schema management capabilities. PlanetScale enables non-blocking schema changes through a workflow that allows developers to create, test, and deploy schema modifications without impacting the production environment. This is facilitated by their concept of database branching, akin to version control systems like Git. Developers can spin up isolated database branches to experiment with changes, run tests, and then merge those changes back into the main branch seamlessly. This drastically reduces the risk associated with schema migrations and empowers our engineers to iterate faster, ultimately accelerating our product development cycles. Just like with Git, if a database schema change is pushed to production and an issue is discovered, it can be reverted easily.

“This new mechanism improved the latency of the previously expensive query by 90%”

PlanetScale also allows for net new mechanisms we can use to serve requests. For instance, we recently used materialized views to optimize the counting of open, closed, and snoozed conversations for teammates. This new mechanism improved the latency of the previously expensive query by 90%, leading to a faster teammate experience and reduced database load.

Additionally, PlanetScale provides automated index and query optimization tools. The platform can analyze query performance and suggest or automatically implement index improvements to enhance database efficiency. This proactive approach to optimization reduces the operational overhead typically associated with manual database tuning – everyone on the team can now operate like a world class database expert. These improvements ensure that our queries run efficiently and allow us to maintain high application performance, which translates to a smoother and more responsive experience for our customers.

Challenges faced during migration

Moving the databases that are responsible for Intercom’s most critical data is a major undertaking and it has not been without its challenges. Despite thorough planning and testing, we encountered several issues that provided valuable learning opportunities and ultimately strengthened our migration strategy as we move across more databases.

Latency spikes due to cold buffer pools

One of the initial hurdles was unexpected latency during the cutover of one of our core databases to PlanetScale. When we redirected traffic to the new Vitess cluster, we anticipated some initial latency as the database caches warmed up. However, the latency spikes were more significant and lasted longer than expected – particularly in one availability zone.

This was primarily due to cold buffer pools on the MySQL instances within Vitess. Since these instances had not served production traffic before, their caches were empty. As a result, queries that would typically be served from memory had to fetch data from disk, increasing response times. While we anticipated this problem, we expected only a few seconds of latency, however in reality it continued for twenty minutes and made the Inbox slow to respond to customer requests.

To mitigate this for subsequent migrations we’ve implemented read traffic mirroring to pre-warm the buffer pools before redirecting live traffic. By simulating traffic to load frequently accessed data into memory, we can reduce the initial latency spikes during future migrations.

Disk I/O saturation and resource limits

During periods of high load after the initial cutover and at traffic peak, we observed that some replica servers were experiencing disk I/O saturation. The replicas reached the maximum IOPS allowed by their attached storage volumes. This led to increased CPU utilization in the “iowait” state, further degrading performance.

“Scaling down by removing excess capacity is significantly faster and less disruptive than scaling up under pressure”

The root cause was that the replicas’ IOPS were under-provisioned for the workload they needed to handle. To resolve this, we initiated the scaling out of additional replicas. However, adding new replicas was time-consuming due to the size of our data – restoring backups to new instances and allowing them to catch up with replication took several hours. During this period, standard operations in the Inbox were 1.5 to 3x slower than usual, with Workload Management most affected – slowing to between 5x and 10x normal latencies.

Our takeaway from this is that we will significantly overscale all clusters as we move across load. Scaling down by removing excess capacity is significantly faster and less disruptive than scaling up under pressure.

Configuration changes and unexpected interactions

We also faced challenges when certain configuration changes interacted poorly with application behavior. For instance, increasing the transaction pool size and the maximum transaction duration seemed beneficial in isolation. However, combined with a surge of scheduled operations, for example bulk unsnoozing of conversations on the hour, these changes led to resource saturation. The database was flooded with long-running transactions, causing latency and errors impacting the Inbox.

The road ahead

Our migration to Vitess is more than just a technological upgrade; it’s a strategic move to future-proof our database architecture for the next decade and beyond. By embracing Vitess and partnering with PlanetScale, we’ve positioned ourselves to provide even greater reliability, scalability, and performance for our customers.

“The lessons we’ve learned and the mitigations we’ve implemented have set us up for success as we continue migrating our remaining infrastructure”

So far, we’ve successfully migrated our databases related to our AI infrastructure and one of our most critical databases powering the Inbox. These early migrations have validated our decision and provided invaluable insights. The lessons we’ve learned and the mitigations we’ve implemented have set us up for success as we continue migrating our remaining infrastructure.

Looking ahead, we’re excited about the possibilities that Vitess and PlanetScale open up for us. The native sharding capabilities will allow us to simplify our database architecture, reducing complexity and operational overhead. Our teams can focus more on delivering innovative features and less on managing database intricacies, ultimately enhancing the experience for our customers.

Update: We’ve made significant progress in our database migration journey. Learn more about the dramatic performance improvements, operational stability, and cost reductions we’ve achieved with PlanetScale Metal in our follow-up post: Evolving Intercom’s database infrastructure: Lessons and progress.

BFY_Blog ad_Vertical_Spring 25_Watch on demand