Personal Website of Levi Carter - Senior Software Developer with Marketpath from Noblesville, Indiana.
Read About Me
Close

Marketpath CMS Live Database Scaling Options

Marketpath CMS live sites are already load-balanced and auto-scaled. We constantly monitor our live resources and availability and so we are highly confident that we are able to effectively serve all of our (paying and non-paying) customers. All of our live resources are capable of scaling manually and almost all of the crucial resources are already set up to scale automatically when the need arises.

Rather than resting in our own self-confidence, however, we want to be proactive about identifying potential bottlenecks and addressing them BEFORE they become an issue - particularly as we grow and anticipate future growth.

One such bottleneck is our live databases. While our databases are capable of scaling manually with little to no downtime, they are not capable of auto-scaling. As such, we err on the side of maintaining databases that are larger than we need and monitoring them thoroughly so that we are able to handle spikes in traffic and other resource-consuming database operations.

As we grow, however, the spikes in our database utilization become more frequent and consistent and we are faced with the (good) decision of how we want to facilitate future growth.

After a thorough understanding of the components involved as well as an examination of the various options available to us, I have identified 4 practical ways that we can scale our database capability. These are outlined more generically in my article, 4 Ways to Scale a Production Database:

Summary

  1. Database optimization is a continual process and not a reactive one-time effort
  2. We are always ready to scale our live databases up as needed but do not want to rely on database scaling alone to handle our needs
  3. There are numerous benefits to a premium database tier but we have a lot of questions about the costs and benefits. Furthermore, the cost makes this option less attractive at our price point.
  4. We WILL scale out to new live instances as we grow, the only questions are when and how to balance that with other scaling options.

Option 1: Optimize the current database

The goal of this strategy is to make the current database resources stretch farther by examing which resources are being over or under utilized and changing the data access and update methods to optimize them.

In our case, we have already spent a lot of effort making the databases efficient. Since we are currently unwilling to attempt larger changes to our architecture and data structures, we are further limited in the types of database optimizations that may be considered.

I am sure that with enough analysis we could find additional efficiencies but they are likely to be minor efficiencies which may or may not be worth the effort they would take.

Material Cost: $0
Developer Time: Minimum 4 hours. Not more than 2 days.
Risk: low to moderate
Estimated Impact: minimal
Confidence: low

There is no reason why we could not optmize the current databases at the same time as implementing other scaling strategies.

Option 2: Scale the database resources up without further changes

The goal of this strategy is to quickly easily scale up the resources allocated to the current database without having to make any other changes. This is by far the fastest and easiest way to grow and should always be considered when the short-term needs will exceed capacity.

While scaling a database up is comparatively fast, it does sometimes take time for the cloud provider to allocate extra resources to the database. And while it is rare, I have also seen a production database go offline for an extended period of time as a result of a botched scaling operation. That was not a fun experience.

In our case, we consistently maintain a oversized databases so that we are prepated to handle spikes in traffic. Our average resource utilization is somewhere around 2% although we do occasionally observe elevated utilization.

Material Cost: 2x the cost of the current database tier.
Developer Time: 2 minutes
Risk: very low
Estimated Impact: maintain the current level of performance for a minimum of 2x the number of current customers
Confidence: high

This is already our first-response strategy for dealing with high database utilization and should continue be used as needed in combination with other strategies. However, we already have a lot of margin in our database and while it is easy to scale up there is also minor risk and moderate cost associated with scaling operations. Manually scaling the database up and down should not be a regular practice in an always-online application such as ours.

Option 3: Switch database models

The goal of this strategy is to use a more effective database model - either by switching database providers or by upgrading to a premium service tier.

In our case there are too many benefits tied to using our current database provider to consider switching providers, and it would take far too much development time to switch to a different type of database (likely with lower instead of higher performance). We can, however, upgrade to a premium service tier for approximately 3x the cost of the current tier.

This move would only add new features and resources so the risk and uncertainty is limited to the actual vs anticipated performance of the higher tier. And while we might anticipate that this move would allow us to serve between two to three times the number of current customers it is also possible that it would only give us a small bump in capacity and a small bump in query times. The other risk with making this switch is that if we later decide that we need to scale up again then the cost of continued scaling is significantly higher than if we remained in our existing service tier.

Material Cost: 3x the cost of the current database tier.
Developer Time: 10 minutes
Risk: low
Estimated Impact: Minimum 25% increased capacity and slighly faster queries. Likely 2x increased capacity and possibly up to 6x.
Confidence: moderate

After upgrading to the premium model, we would also be able to optimize our application to take advantage of some of the new premium features to further increase our capacity while lowering response times. Unfortunately the increased cost is a major deterrent, and while our customers would appreciate the faster response times it is unclear if they would be willing to take on the financial burden required to accomplish it (considering this is just one component of our application architecture).

Nonetheless, if we do not opt to scale out we will likely reach a point where this becomes necessary in order to maintain our current service level for our customers.

Option 4: Scale out instead of scale up

The goal of this strategy is to add new databases or new instances of your application instead of scaling up the resources for your existing database.

Scaling out to multiple live instances would come with numerous benefits such as predictable linear scaling and isolating customers from potential negative side-affects caused by other customers.

In our case, this would involve not only adding a new database but would also add all of the other live infrastucture (networking, auto-scaled load balanced VMs, storage, supporting applications, etc...). Thankfully, we have already written all of the application and deployment code to support adding new live instances.

Because we have already prepared for this, the only up-front effort required to add a new live instance would be updating some configuration files and deploying the new resources. Furthermore, we have already optimised our deployment pipeline so that we can push updates out to each live instance simultaneously rather than requiring each live instance to be udpated one-at-a-time, saving a significant amount of time in our deployment pipeline.

One downside to scaling out is that it has only been minimally proven in our existing codebase and there is a risk of running into issues related to managing multiple live instances (such as performance differences bewteen instances, migrating sites between instances, managing DNS settings, handling errors in deployments when one instance succeeds and another fails, etc...). Furthermore, it is at times inconvenient for the development team to have sites in separate live instances and deployments will take incrementally longer with each added instance despite our efforts to minimize concurrent deployment times.

Material Cost: linear for the entire live architecture
Developer Time: 2 hours
Risk: moderate
Estimated Impact: Linear growth in capacity
Confidence: high

The ability to handle multiple live instances has always been built into the core architecture of our application, and is a core component of our growth strategy. The question for us isn't as much if we will scale out but rather when will we scale out. Furthermore, we offer private instances to customers who want it and are willing to pay for it.

Conclusion

We are always open to optimizing our current databases when we uncover inefficiencies, but we do not anticipate that there are any significant gains to be had here. And while scaling our live databases up is always an option, it eventually reaches a point where the benefits of other scaling strategies start to become more significant.

We have already have the capability to create more live instances as our growth and adoption increases. The ability to serve live websites from different instances has always been a part of our architecture and requires very little effort on our part to implement and maintain now.

Most of our unanswered questions center around the benefits and drawbacks of scaling to a premium database tier. For example: how many new customers could we support by going premium? Do we even WANT to add more customers to the same database, or would we prefer to split them into multiple instances for the security benefits? What kind of performance improvements would we realize? Would our customers be willing to pay more for the improved performance (probably not)?

Another unanswered question is when should we create new live public instances? After all, waiting for sites to slow down or stop responding is simply too late to make that call. The answer to that question is clearly related to whether or not we upgrade our live instances to premium database tiers, but even then we need a way to know when the time has come to add a new instance - either for security or for performance reasons.