Jonty Behr  •  05 Feb 2019

Upgrading or Migrating RDS Databases With (Almost) No Downtime

On one of our projects, we've been running an AWS RDS database for the past few years. However, because RDS is relatively expensive, we usually purchase reserved instances to soften the blow a little bit. As it was time for renewal, I found out that the cost for our instance type (m3.large) was actually more expensive than the latest generation (m5.large). A quick comparision also showed improved CPU performance (10 ECU vs 6.5 ECU), slightly more ram (8GB vs 7.5GB) and much improved network performance. What's not to like?

Initial thoughts

Initially I thought that we would have to turn the production site off for a period of time. This would involve taking a database snapshot through RDS, launching a new instance based on the snapshot and then changing all the app configs to point to the new RDS instance. This would entail at least an hour of downtime (and possibly more), something that we really wanted to avoid as this is a relatively high traffic site.

A better option

After some investigation, I found a much better way to achieve the migration with almost no downtime.

First, backup!

This was our first step to ensure that if anything went completely wrong, we at least had a very recent backup. So I took an RDS snapshot, and also manually ran a mysqldump command to an offsite location.

Create an RDS Replica

Using the RDS console, I created a replica from the master instance. RDS doesn't constrain you in the instance type that you set for the replica, so we set the instance type to m5.large, even though the master was an m3.large. I left the replica running for a few hours to ensure that it had fully caught up with the master.

Promote the replica

As RDS replicas are read-only, it would not be possible to point our applications to the replica for write operations (inserts and updates). So we waited for a relatively quiet period (i.e. late at night!) and I checked the RDS monitoring to ensure that there was no replication lag.

After putting the site into maintenance mode, I promoted the Replica to a Stand alone instance. At this point, I could have simply changed the app database config to point to the replica (app-replica) instead of the master (app-master). Truth be told, I was a bit worried that I'd miss a config setting somewhere as we also have a few related micro-services connecting to the same database. Instead, what I did was change the old master's name to app-master-old and updated the new master's name to app-master. By doing this, I did not need to change any of the app database credentials anywhere.

Once that had all taken place, I took the sites out of maintenance mode. All-in-all, there was about 3 minutes of downtime, which I was really happy with.

An unexpected upside

One of the benefits of changing the instance names in this way is that the cloudwatch logs for the app-master were merged together - i.e. the old app-master and the new app-master are all shown on the same chart. Nothing ground-breaking, but nice nonetheless.

In the End

Yes, there are other ways of achieving the same thing, but most of them would have been more complicated and taken more time in planning and execution. We wanted this to be done and dusted with the minimum fuss, and I was really happy with the outcome.