How is the Multi-AZ deployment of Amazon RDS realized?

Question

Recently I'm considering to use Amazon RDS Multi-AZ deployment for a service in production environment, and I've read the related documents.

However, I have a question about the failover. In the FAQ of Amazon RDS, failover is described as follows:

Q: What happens during Multi-AZ failover and how long does it take?

Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer. Failover times are a function of the time it takes crash recovery to complete. Start-to-finish, failover typically completes within three minutes.

From the above description, I guess there must be a monitoring service which could detect failure of primary instance and do the flipping.

My question is, which AZ does this monitoring service host in? There are 3 possibilities: 1. Same AZ as the primary 2. Same AZ as the standby 3. Another AZ

Apparently 1&2 won't be the case, since it could not handle the situation that entire AZ being unavailable. So, if 3 is the case, what if the AZ of the monitoring service goes down? Is there another service to monitor this monitoring service? It seems to be an endless domino.

So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?

cerberus cerberus · Accepted Answer · 2018-06-12T20:27:55

So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?

I think that the "how" in this case is abstracted by design away from the user, given that RDS is a PaaS service. A multi-AZ deployment has a great deal that is hidden, however, the following are true:

You don't have any access to the secondary instance, unless a failover occurs
You are guaranteed that a secondary instance is located in a separate AZ from the primary

In his blog post, John Gemignani mentions the notion of an observer managing which RDS instance is active in the multi-AZ architecture. But to your point, what is the observer? And where is it observing from?

Here's my guess, based upon my experience with AWS:

The observer in an RDS multi-AZ deployment is a highly available service that is deployed throughout every AZ in every region that RDS multi-AZ is available, and makes use of existing AWS platform services to monitor the health and state of all of the infrastructure that may affect an RDS instance. Some of the services that make up the observer may be part of the AWS platform itself, and otherwise hidden from the user.

I would be willing to bet that the same underlying services that comprise CloudWatch Events is used in some capacity for the RDS multi-AZ observer. From Jeff Barr's blog post announcing CloudWatch Events, he describes the service this way:

You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules, it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.

Think of the observer the same way - it's a component of the AWS platform that provides a function that we, as the users of the platform do not need to think about. It's part of AWS's responsibility in the Shared Responsibility Model.

How is the Multi-AZ deployment of Amazon RDS realized?

4 Answers