Automatic recovery from an availability zone outage?

Question

Are there any tools or techniques available to automatically create new instances in a different availability zone in the event that an availability zone suffers an outage in Amazon Web Services/EC2?

I think I understand how to do automatic fail over in the event of an availability zone (AZ) outage, but what about automatic recovery (create new instances in a new AZ) from an outage? Is that possible?

Example scenario:

We have a three-instance cluster.
An ELB round-robins traffic to the cluster.
We can lose any one instance, but not two instances in the cluster, and still be fully functional.
Because of (3), each instance is in a different AZ. Call them AZs A, B and C.
The ELB health check is configured so that the ELB can ensure each instance is healthy.
Assume that one instance is lost due to an AZ outage in AZ A.

At this point the ELB will see that the lost instance is no longer responding to health checks and will stop routing traffic to that instance. All requests will go to the two remaining healthy instances. Failover is successful.

Recovery is where I am not clear. Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)? This will avoid the AZ that had the outage (A) and not use an AZ that already has an instance in it (AZs B and C).

AutoScaling Groups?

AutoScaling Groups seem like a promising place to start, but I don't know if they can deal with this use case properly.

Questions:

In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true? In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?

Are these true shortcomings in AutoScaling Groups, or am I missing something?

If this can't be done with AutoScaling Groups, is there some other tool that will do this for me automatically?

In 2011 FourSquare, Reddit and others were caught by being reliant on a single availability zone (http://www.informationweek.com/cloud-computing/infrastructure/amazon-outage-multiple-zones-a-smart-str/240009598). It seems like since then tools would have come a long way. I have been surprised by the lack of automated recovery solutions. Is each company just rolling its own solution and/or doing the recovery manually? Or maybe they're just rolling the dice and hoping it doesn't happen again?

Update:

@Steffen Opel, thanks for the detailed explanation. Auto scaling groups are looking better, but I think there is still an issue with them when used with an ELB.

Suppose I create a single auto scaling group with a min, max & desired set to 3, spread across 4 AZs. Auto scaling would create 1 instance in 3 different AZs, with the 4th AZ left empty. How do I configure the ELB? If it forwards to all 4 AZs, that won't work because one AZ will always have zero instances and the ELB will still route traffic to it. This will result in HTTP 503s being returned when traffic goes to the empty AZ. I have experienced this myself in the past. Here is an example of what I saw before.

This seems to require manually updating the ELB's AZs to just those with instances running in them. This would need to happen every time auto scaling results in a different mix of AZs. Is that right, or am I missing something?

Steffen Opel Steffen Opel · Accepted Answer · 2013-05-01T19:59:39

Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)?

Auto Scaling is indeed the appropriate service for your use case - to answer your respective questions:

In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true? In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?

You don't have to specify/tell anything of that explicitly, it's implied in how Auto Scaling works (See Auto Scaling Concepts and Terminology) - You simply configure an Auto Scaling group with a) the number of instances you want to run (by defining the minimum, maximum, and desired number of running EC2 instances the group must have) and b) which AZs are appropriate targets for your instances (usually/ideally all AZs available in your account within a region).

Auto Scaling then takes care of a) starting the requested number of instances and b) balancing these instance in the configured AZs. An AZ outage is handled automatically, see Availability Zones and Regions:

Auto Scaling lets you take advantage of the safety and reliability of geographic redundancy by spanning Auto Scaling groups across multiple Availability Zones within a region. When one Availability Zone becomes unhealthy or unavailable, Auto Scaling launches new instances in an unaffected Availability Zone. When the unhealthy Availability Zone returns to a healthy state, Auto Scaling automatically redistributes the application instances evenly across all of the designated Availability Zones. [emphasis mine]

The subsequent section Instance Distribution and Balance Across Multiple Zones explains the algorithm further:

Auto Scaling attempts to distribute instances evenly between the Availability Zones that are enabled for your Auto Scaling group. Auto Scaling does this by attempting to launch new instances in the Availability Zone with the fewest instances. If the attempt fails, however, Auto Scaling will attempt to launch in other zones until it succeeds. [emphasis mine]

Please check the linked documentation for even more details and how edge cases are handled.

Update

Regarding your follow up question about the number of AZs being higher than the number of instances, I think you need to resort to a pragmatic approach:

You should simply select a number of AZz equal or lower than the number of instances you want to run; in case of an AZ outage, Auto Scaling will happily balance your instances across the remaining healthy AZs, which means you'd be able to survive the outage of 2 out of 3 AZs in your example and still have all 3 instances running in the remaining AZ.

Please note that while it might be intriguing to use as many AZs as are available, New customers can access three EC2 Availability Zones in US East (Northern Virginia) and two in US West (Northern California) only anyway (see Global Infrastructure), i.e. only older accounts might actually have access to all 5 AZs in us-east-1, some just 4 and newer ones 3 at most.

I consider this to be a legacy issue, i.e. AWS is apparently rotating older AZs out of operation. For example, even if you have access to all 5 AZs in us-east-1, some instances types might not be available in all of these in fact (e.g. the New EC2 Second Generation Standard Instances m3.xlarge and m3.2xlarge are only available in 3 out of 5 AZs in one of the accounts I'm using).

Put another way, 2-3 AZs are considered to be a fairly good compromise for fault tolerance within a region, if anything cross region fault tolerance would probably be the next thing I'd be worried about.