Are there any tools or techniques available to automatically create new instances in a different availability zone in the event that an availability zone suffers an outage in Amazon Web Services/EC2?
I think I understand how to do automatic fail over in the event of an availability zone (AZ) outage, but what about automatic recovery (create new instances in a new AZ) from an outage? Is that possible?
Example scenario:
- We have a three-instance cluster.
- An ELB round-robins traffic to the cluster.
- We can lose any one instance, but not two instances in the cluster, and still be fully functional.
- Because of (3), each instance is in a different AZ. Call them AZs A, B and C.
- The ELB health check is configured so that the ELB can ensure each instance is healthy.
- Assume that one instance is lost due to an AZ outage in AZ A.
At this point the ELB will see that the lost instance is no longer responding to health checks and will stop routing traffic to that instance. All requests will go to the two remaining healthy instances. Failover is successful.
Recovery is where I am not clear. Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)? This will avoid the AZ that had the outage (A) and not use an AZ that already has an instance in it (AZs B and C).
AutoScaling Groups?
AutoScaling Groups seem like a promising place to start, but I don't know if they can deal with this use case properly.
Questions:
In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true? In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?
Are these true shortcomings in AutoScaling Groups, or am I missing something?
If this can't be done with AutoScaling Groups, is there some other tool that will do this for me automatically?
In 2011 FourSquare, Reddit and others were caught by being reliant on a single availability zone (http://www.informationweek.com/cloud-computing/infrastructure/amazon-outage-multiple-zones-a-smart-str/240009598). It seems like since then tools would have come a long way. I have been surprised by the lack of automated recovery solutions. Is each company just rolling its own solution and/or doing the recovery manually? Or maybe they're just rolling the dice and hoping it doesn't happen again?
Update:
@Steffen Opel, thanks for the detailed explanation. Auto scaling groups are looking better, but I think there is still an issue with them when used with an ELB.
Suppose I create a single auto scaling group with a min, max & desired set to 3, spread across 4 AZs. Auto scaling would create 1 instance in 3 different AZs, with the 4th AZ left empty. How do I configure the ELB? If it forwards to all 4 AZs, that won't work because one AZ will always have zero instances and the ELB will still route traffic to it. This will result in HTTP 503s being returned when traffic goes to the empty AZ. I have experienced this myself in the past. Here is an example of what I saw before.
This seems to require manually updating the ELB's AZs to just those with instances running in them. This would need to happen every time auto scaling results in a different mix of AZs. Is that right, or am I missing something?