Keep the databases top-notch

Written by wordpress May 9, 2023

Photo by ThisisEngineering RAEng on Unsplash

Databases are often the most critical parts of applications. Prezi is no exception to this.
We store and read persistent data from many different databases while serving requests from our users.

There is often some hesitation when it comes to updating databases. Taking down databases for maintenance usually takes down the services that use them.

Early this year, the SRE team started to work on the maintenance task for our AWS RDS MySQL instances. We started looking at our options.

A very classical approach is to define a maintenance window and do the necessary actions within this window. This works but is not a very user-friendly approach. It is even worse if one operates at a global scale from a central infrastructure: which user do you want to hit with the downtime?

A benefit of this approach is that you create a clear state in the database before the action by disabling all writing operations at the beginning. In case the application can serve only read-only traffic, it is also possible to offer a read-only instance during the maintenance window.

As we looked into the services that would be affected by our maintenance, we discovered that two services within the critical path of our user experience were affected. Declaring a maintenance window and shutting down the service would have been a major degradation of our offered service.

We ran tests on snapshots of those instances and concluded that we would need a downtime of roughly 1 hour to be finished.

We said: this is only the way to go if we find nothing else.

There are some strategies that can help to hide most of the downtime from your users. It is possible to access a database through a load balancer that is able to forward traffic only to a specific instance.

It is also possible to use a primary/secondary setup where you can promote an already patched instance to become the new primary instance.

Most of the time, the challenge in this approach is the synchronization between the instances — just to name some: transactions, foreign keys, and network latency between the different systems that can make transactions tedious.

And this design pattern must be incorporated from the beginning. Changing the setup from connecting to a single instance to an advanced setup needs downtime too.

Another downside of this approach is: You have to pay for both instances.

For our case, this approach wasn’t the way to go as we didn’t build most of our databases in such a design.

AWS introduced blue/green deployment at the Re:Invent late in 2022. There is this overlay at the top of the RDS console that keeps coming up and tells you to try blue/green deployment.

AWS Console

Blue/Green offers the possibility to create a database cluster on the fly from a primary instance (blue). The newly created instance (green) can be updated without affecting the primary instance. There is a logical sync from blue to green. Under the hood, AWS RDS creates a MySql replication that syncs contents from the blue environment to the green one.

The initial creation of the blue/green setup is transparent for the applications that use the database. So you can set up all the things without affecting your users.

Once ready and running, the connections can be switched over from blue to green with a short downtime of approximately 1 minute.

After successfully switching traffic from the blue to the green database, the old instance is retained. It can serve as a fallback if the new instance is somehow broken. Once confident that nothing was broken, it can be removed. So you only have to run 2 databases at a time for the time of maintenance tasks.

In the past 4 weeks, we used this approach for updating roughly 30 databases to the latest MySQL minor version. It worked like a charm. Even for our most critical databases, we were able to reduce the downtime to approximately 3 minutes.

Things we learned during that process:

There is no way back after you’ve triggered the switchover process. After moving the traffic over from blue to green, you’ll end up with two independent instances that are no longer in sync. The replication in between is gone.
Your old instance even has a new VPC endpoint. There is no way to sync those databases again.
This means: If you need the old instance again, a possible rollback can be a re-deployment of the application with a config that uses the old instance again.
Doing downtime-free database upgrades manually is a tedious, complex, and error-prone process. It involves a lot of clicking and a lot of waiting. It would be better to automate the steps.
Take care that you select the correct MySQL parameter group while setting up your blue/green deployment. You can’t change this on the green instance afterward. It cost us some time to rebuild the blue/green deployment.
Always do a manual and fresh snapshot before any actions. Even though we did not need it, it makes you more confident if you have a clear state to go back to.

The blue/green deployment makes database updates easy and very user-friendly.

The next step is building automation scripts around this, which prepare everything up to the point of the switchover. This will make the process less error-prone and it will require less engineer time.

Source link

Leave a Reply Cancel reply