12
votes

Recently I'm considering to use Amazon RDS Multi-AZ deployment for a service in production environment, and I've read the related documents.

However, I have a question about the failover. In the FAQ of Amazon RDS, failover is described as follows:

Q: What happens during Multi-AZ failover and how long does it take?

Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer. Failover times are a function of the time it takes crash recovery to complete. Start-to-finish, failover typically completes within three minutes.

From the above description, I guess there must be a monitoring service which could detect failure of primary instance and do the flipping.

My question is, which AZ does this monitoring service host in? There are 3 possibilities: 1. Same AZ as the primary 2. Same AZ as the standby 3. Another AZ

Apparently 1&2 won't be the case, since it could not handle the situation that entire AZ being unavailable. So, if 3 is the case, what if the AZ of the monitoring service goes down? Is there another service to monitor this monitoring service? It seems to be an endless domino.

So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?

4

4 Answers

1
votes

So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?

I think that the "how" in this case is abstracted by design away from the user, given that RDS is a PaaS service. A multi-AZ deployment has a great deal that is hidden, however, the following are true:

  • You don't have any access to the secondary instance, unless a failover occurs
  • You are guaranteed that a secondary instance is located in a separate AZ from the primary

In his blog post, John Gemignani mentions the notion of an observer managing which RDS instance is active in the multi-AZ architecture. But to your point, what is the observer? And where is it observing from?

Here's my guess, based upon my experience with AWS:

The observer in an RDS multi-AZ deployment is a highly available service that is deployed throughout every AZ in every region that RDS multi-AZ is available, and makes use of existing AWS platform services to monitor the health and state of all of the infrastructure that may affect an RDS instance. Some of the services that make up the observer may be part of the AWS platform itself, and otherwise hidden from the user.

I would be willing to bet that the same underlying services that comprise CloudWatch Events is used in some capacity for the RDS multi-AZ observer. From Jeff Barr's blog post announcing CloudWatch Events, he describes the service this way:

You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules, it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.

Think of the observer the same way - it's a component of the AWS platform that provides a function that we, as the users of the platform do not need to think about. It's part of AWS's responsibility in the Shared Responsibility Model.

0
votes

Educated guess - the monitoring service runs on all the AZs and refers to a shared list of running instances (which is sync-replicated across the AZs). As soon as a monitoring service on one AZ notices that another AZ is down, it flips the CNAMES of all the running instances to an AZ which is currently up.

0
votes

We did not get to determine where the fail-over instance resides, but our primary is in US-West-2c and secondary is in US-West-2b.

Using PostgreSQL, our data became corrupted because of a physical problem with the Amazon volume (as near as we could tell). We did not have a multi-AZ set up at the time, so to recover, we had to perform a point-in-time restore as close in time as we could to the event. Amazon support assured us that had we gone ahead with the Multi-AZ, they would have automatically rolled over to the other AZ. This begs the questions how they could have determined that, and would the data corruption propagated to the other AZ?

Because of that shisaster, we also added a read-only replica, which seems to make a lot more sense to me. We also use the RO replica for read and other functions. My understanding from my Amazon rep is that one can think of the Multi-AZ setting as more like a RAID situation.

0
votes

From the docs, fail over occurs if the following conditions are met:

  • Loss of availability in primary Availability Zone
  • Loss of network connectivity to primary
  • Compute unit failure on primary
  • Storage failure on primary

This infers that the monitoring is not located in the same AZ. Most likely, the read replica is using mysql functions (https://dev.mysql.com/doc/refman/5.7/en/replication-administration-status.html) to monitor the status of the master, and taking action if the master becomes unreachable.

Of course, this bears the question what happens if the replica AZ fails? Amazon most likely has checks in the replica's failure detection to figure out whether it's failing or the primary is.