I have currently a cluster HA (with three multiple masters, one for every AZ) deployed on AWS through kops. Kops deploys a K8S cluster with a pod for etcd-events and a pod for etcd-server on every master node. Every one of this pods uses a mounted volume.
All works well, for example when a master dies, the autoscaling group creates another master node in the same AZ, that recovers its volume and joins itself to the cluster. The problem that I have is respect to a disaster, a failure of an AZ.
What happens if an AZ should have problems? I periodically take volume EBS snapshots, but if I create a new volume from a snapshot (with the right tags to be discovered and attached to the new instance) the new instance mounts the new volumes, but after that, it isn't able to join with the old cluster. My plan was to create a lambda function that was triggered by a CloudWatch event that creates a new master instance in one of the two safe AZ with the volume mounted from a snapshot of the old EBS volume. But this plan has errors because it seems that I am ignoring something about Raft, Etcd, and their behavior. (I say that because I have errors from the other master nodes, and the new node isn't able to join itself to the cluster).
Suggestions?
How do you recover theoretically the situation of a single AZ disaster and the situation when all the master died? I have the EBS snapshots. Is it sufficient to use them?