Coreos fleet not working after auto-scaling

Question

I have CoreOS cluster with 3 AWS ec2 instances. The cluster was setup using the CoreOS stack cloudformation. After the cluster is up and running, I need to update the autoscaling policy to pick up ec2 instance profile. I copied the existing auto-scaling configuration file and updated the IAM role for ec2s. Then I terminated EC2s in the fleet, letting the auto-scaling to fire up new instances. The new instances indeed assumed their new roles, however, the cluster seems lost cluster machine information:

ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
  Drop-In: /run/systemd/system/etcd.service.d
       └─10-oem.conf, 20-cloudinit.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
  Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
 Main PID: 14124 (code=exited, status=1/FAILURE)

Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process  exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO      | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING   | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL  | fail joining the cluster via given peers after 3 retries

The same token was used from cloud-init. https://discovery.etcd.io/<cluster token> shows 6 machines, with 3 dead ones, 3 new ones. So it looks like 3 new instances joined the cluster alright. The journal -u etcd.service logs shows the etcd timed out on dead instances, and got connection refused for the new ones.

journal -u etcd.service shows: 
...

Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO      | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)

etcdctl --debug  ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?     consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error:  501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Maybe this is not right process to update a cluster's configuration, but IF the cluster does need auto-scaling for whatever reasons (load triggered for example), will the fleet still be able to function with dead instances and new instances mixed in the pool?

How to recover from this situations without tear down and rebuild?

Xueshan

Rob Rob · Accepted Answer · 2014-09-26T17:23:06

In this scheme etcd will not remain with a quorum of machines and can't operate successfully. The best scheme for doing autoscaling would be to set up two groups of machines:

A fixed number (1-9) of etcd machines that will always be up. These are set up with a discovery token or static networking like normal.
Your autoscaling group, which doesn't start etcd, but instead configures fleet (and any other tool) to use the fixed etcd cluster. You can do this in cloud-config. Here's an example that also sets some fleet metadata so you can schedule jobs specifically to the autoscaled machines if desired:

#cloud-config
coreos:
  fleet:
    metadata: "role=autoscale"
    etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
  units:
    - name: fleet.service
      command: start

The validator wouldn't let me put in any 10.x IP addresses in my answer (wtf!?) so be sure to replace those.

Coreos fleet not working after auto-scaling

2 Answers