Expected unvailability during Cloud SQL Postgres failovers and CPU/memory upgrades?

Question

I have some experience with AWS RDS MySQL multi-AZ (HA). I'm looking at GCP Cloud SQL Postgres HA for a new project.

I'm trying to figure how certain maintenance operations work but can't figure it out from the Cloud SQL docs.

How much unavailability does a failover cause?
How much unavailability does a CPU/memory upgrade cause?
After a failover, is it important to eventually "failback" to the original primary instance? Or can I leave it running on the standby instance indefinitely? (The Cloud SQL HA failover diagram make it seem like the two instances aren't totally symmetric.)

Just FYI, the answers for AWS RDS

Failover: usually under 70 seconds of unavailability before my application is able to issue queries again.

This is for planned failovers. (For unplanned failovers, it may take a little longer for RDS to detect that the primary instance is unresponsive before it actually initiates the failover.)
A lot of the failover lag is likely due to DNS. Using the AWS RDS Proxy service may reduce that time (they claim by ~80%). The Cloud SQL HA failover diagram shows both instances sharing a virtual IP, which might mean no DNS lag?

CPU/memory upgrade: I think AWS can accomplish this with a single failover worth of unavailability. It upgrades the standby instance (no unavailability), performs a failover, then upgrades the other instance.

On RDS, I think the two instances that are part of the HA set up are symmetric. So if you failover to the standby, it's fine to leave it that way. There's no need (as far as RDS is concerned) to failover back to the original.

Dondi Dondi · Accepted Answer · 2020-12-04T02:56:50

To answer your following questions:

As you mentioned, the duration of the unavailability would vary depending if it is a planned (manual) failover vs unplanned. It's best that you test and manually initiate the failover so you can see how long your instance would respond to it, usually it would take a minute or so. When it comes to unplanned failovers, it's pretty much covered in the docs that when failover occurs, any existing connections to the primary instance and read replicas are closed, and it will take approximately 2-3 minutes for connections to be reestablished.
To address this question, you need to understand the requirements for your instance to allow failover:

The primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running Cloud SQL instance operation such as a backup, import or export operation).

That means that failover doesn't count when upgrading your instance, changing your hardware specs (CPU/Memory) will incur downtime so you should plan ahead when making these changes.

To understand the importance of failback, here's an excerpt from this link:

High availability solutions continuously replicate data to a remote site or cloud. In the event that a primary system goes down, the remote, secondary system can be spun up and users are rerouted. This process is commonly referred to as “failover,” and it reduces downtime to seconds or minutes.

However, failover isn’t a permanent state. Once primary servers are up and running, data and applications must be restored so normal operations can resume. This process is known as failback, and it is very important from a DR testing standpoint. Here’s why: Not all replication technology is created equally when it comes to failback. In some cases, failing back to production servers can be painfully slow.

UPDATE 1: HA on Cloud SQL will provision specs for your standby instance similar to your primary, that's why you'll get billed double the price of a non-HA instance. Also, the importance of failback is not limited to any cloud providers. It is simply a good practice to make sure that all the operation returns to your primary instance instead of just leaving it on a standby instance. On that case, failback (on Cloud SQL to be specific) is really necessary to make sure that everything is back to normal after an outage.

UPDATE 2: If you don't failback, what could happen is that when there's an outage on the zone where your standby instance is running (you can't control what zone your standby instance will come from), you won't be able to do a failover as the operation will be blocked. (See the docs)

Unfortunately there's pretty much no option as the downtime is required whenever you change hardware. The procedure will require the instance to restart. Here's a link to see how long it would take.

Additional resources: https://severalnines.com/database-blog/achieving-mysql-failover-failback-google-cloud-platform-gcp

Expected unvailability during Cloud SQL Postgres failovers and CPU/memory upgrades?

1 Answers