4 Ways to Scale a Production Database

As a part of my regular responsibilities I both actively and passively monitor the resource utilization for a number of applications. Mostly this is so that I can identify and respond to infrastructure, availability, or other issues that might come up. A second purpose for this monitoring is to identity trends in resource utilization as the product and adoption grows so that we can stay ahead of the need for additional capacity.

It is difficult to understate the importance of being proactive in identifying potential bottlenecks and addressing them BEFORE they become an issue.

One of the primary resources used by the vast majority of applications is a database. In the even that the current or anticipated needs exceed current database capacities, I have identified four strategies to scale a database.

True, some applications utilize alternate data sources (event streams, hubs, APIs, etc...) or even none at all (pure IO) but those applications have completely different models for scaling and I am not focusing on those right now. In practice, you could probably directly correlate the models for scaling in this article to the models for scaling applications with alternate data sources.

Strategy 1: Optimize the current database

The goal of this strategy is to make the current database resources stretch farther by examing which resources are being over or under utilized and changing the data access and update models to optimize them. This may be as simple as creating new indexes or modifying a few queries to reduce CPU. It may involve utilizing bulk data operations to reduce IO. It may involve updating your database configuration (such as diabling SI/RCSI) to reduce Log IO. It may even involve restructuring your application to require fewer queries or eliminate unnecessary transactions. The specifics of how to optimize your current database will vary widely for each application and you are not likely to know exactly how to optimize it until you have spent some time really digging to understand your inefficiencies.

Cost:

Unknown Development Time

Risks:

Potential waste of development time with little to show for it.
Costly mistakes
Difficult to pin down the scope of changes - what starts out as small changes may end up in massive refactors.

Rewards:

Varies dramatically. The greatest rewards will come from databases that have a lot of room for improvement.
Optimistically, you may observe a significant performance gain from a modest amount of effort. In most cases you are still just buying time until you need to expand your capacity using another method, but the performance gain should also stack with other database scaling methods.
Pessimistically, you may have already spent a lot of effort making your database efficient and you may not see significant improvement unless you start making much larger changes to your architecture and/or database - which also come with increased risks and development costs.

Strategy 2: Scale the database resources up without further changes

If you are using a cloud provider for your database, you can probably relatively easily scale up the resources to your database without having to make any other changes in your code. Your features and core database performance will probably be exactly the same.

This is by far the fastest and easiest way to grow your capacity, but may also be the most expensive over time. Furthermore, you are likely to eventually reach a point where it is not feasible to continue growing using only this method.

If you maintain your own database it is a little (OK, a LOT) more complicated but the same principles may apply. But that depends on exactly how you host and maintain your database and hopefully you are already aware of the challenges associated with scaling your own database.

Cost:

Semi-linear
Ex: To scale a database within the same service tier in Azure generally doubles the database resources for double the cost.

Risk:

The resources that your database requires as you grow may not be linear and you could run through resources faster than anticipated.
Keeping everything in one database increases the risk of contention and deadlocks affecting multiple (or even all) of your customers.
If this is the only way you plan on handling growth you may eventually run out of the ability to scale up your database.
This only scales a single component of your application. You may need to scale other parts of your infrastructure as well.

Reward:

Maintain the current level of performance with low (or no) immediate risk.
May or may not scale linearly
Takes affect immediately - no development or release required.
Eg: In Azure, moving to the next pricing tier may serve roughly twice the number of customers. Optimistically you may handle significantly more than twice the number of customers due to economies of scale.

Strategy 3: Switch database models

Again, if you are using a cloud provider for your database, you can probably scale up to a different database model relatively easily - with little to no coding required. In Azure this may mean moving from a "Standard" to a "Premium" tier of services or moving from a serverless to a provisioned database. There may be a bit more uncertainty associated with this move as you switch from a familiar to an unfamiliar model. The more similar the models the more confidence you will likely have that it will work (eg: Standard to Premium tiers).

For one of our applications we considered upgrading from a Standard to a Premium tier. For 3x the cost of an S3 database, we could upgrade to a P1 database. Unfortunately, the majority of what we would gain by that upgrade is unclear in the documentation - other than 25% more resources being granted to the database. At first glance, paying 300% for an extra 25% sounds like a terrible deal, but a closer examination reveals that the comparison is not that simple. According to the Azure documentation, the premium tier boasts a fraction of the latency, a minimum of 6 times the number of IO operations per "DTU" (resources), and better CPU speeds. That changes all of the calculations. Furthermore, the premium teir offers the "Read scale-out" feature which may further improve your observed database performance.

Depending on your application architecture, you could even go crazy here and switch databases or database providers entirely. For example: you might consider moving your data from an SQL database into a NoSQL Database or vise versa, you might consider moving from a cloud-managed database to a self-managed database, or you might consider switching to a different cloud provider for your database. Upgrading database models within a database provider is not too difficult but switching database models entirely is extreme, and should ONLY be considered under special circumstances.

Cost:

Typically higher than scaling up, and may either scale more quickly or slowly after the initial change.
Ex: Moving from an S3 to a P1 database in Azure triples the cost.

Risk:

Potential unknowns when switching pricing models may initially make the cost and performance less predictable.
You may need some additional education and/or development to properly take advantage of the features in your new database mode.
Keeping everything in one database increases the risk of contention and deadlocks affecting multiple (or even all) of your customers.
This only scales a single component of your application. You may need to scale other parts of your infrastructure as well.

Reward:

Varies dramatically based on the nature of the switch and must be examined for each case.
More and/or more desirable features available
In many cases you may be able to accomplish more with fewer resources when switching to a premium model
May have a better SLA or other improved business features.
As an example, moving from a Standard to a Premium database in Azure would result in lower latency, more IOPS, and higher availability via automatic replicas. Although it costs three times as much as the standard tier you could potentially improve performance while serving 3-8 times the existing load - all without making any significant code changes. There is also an additional opportunity to restructure your code to take advantage of the "Read scale-out" feature to further improve performance and scalability.

Stretegy 4: Scale out instead of scale up

This may not be an option for everyone, but should at least be considered for most SaaS products. Instead of scaling up the number of resources for your database, consider adding new databases or even spinning up new instances of your entire application. Of course, you have to be ready and equipped to handle this, but there are a number of advantages once you do. In particular, this enables you to scale predictably and virtually indefinitely.

Each time you scale your deployment and maintenance tasks will grow slightly as well, which must be accounted for. But a good automated deployment solution should be able to handle the increased load with relative ease.

This has the added advantage of isolating your customers from any potential negative side-affects caused by other customers. You may even consider offering an "isolated instance" as a special add-on for customers who are interested in that level of service.

Cost:

Linear for PaaS costs
You will likely require additional development in order to scale out effectively (at least the first time).
Deployment and maintenance tasks will become slightly more complicated each time you scale out.

Risk:

Especially the first time your scale out you run the risk of uncovering bugs related to managing multiple instances of your application.
Running manual queries against your database(s) may become more difficult due to the need to run them against multiple smaller databases instead of a single large database.

Reward:

Linear growth
Immediately and predictably increase capacity with only a very small number of unknowns.
Issues from one client can no longer bring down services to all others since they are spread out across multiple instances.
Ability to sell and market the resiliency of your services as proven by your isolated infrastructure (and hopefully all of the best practices that go along with that).

Other

It is worth noting that just because you choose to implement one option does not preclude you from also implementing another option later. For example: you could scale up the database for now while starting work on optimizing your code, then later scale it back down once you release your database optimizations. Or you may decide both to switch to a premium database model and scale out the number of live instances to quickly handle a dramatic increase in growth rate. This would seem like overkill right now but if you experience exponential growth it is good to have options and a plan in place for how to handle it.

There are of course other options which may work better for different situations. One such example might be combining multiple smaller databases into a single elastic pool with a higher number of resources - lowering cost at the same time as making more resources available for each smaller database. Or switching data repository types entirely to do away with the traditional (shared) database in favor of many small (dedicated) containerized databases. Just remember that the bigger your changes the more work you will have to do in order to be confident in your solution.