Adding removed etcd member in Kubernetes master

Question

I was following Kelsey Hightower's kubernetes-the-hard-way repo and successfully created a cluster with 3 master nodes and 3 worker nodes. Here are the problems encountered when removing one of the etcd members and then adding it back, also with all the steps used:

3 master nodes:
10.240.0.10 controller-0
10.240.0.11 controller-1
10.240.0.12 controller-2

Step 1:

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list   --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Result:

b28b52253c9d447e, started, controller-2, https://10.240.0.12:2380, https://10.240.0.12:2379
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379
ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379

Step 2 (Remove etcd member of controller-2):

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member remove b28b52253c9d447e   --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Step 3 (Add the member back):

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member add controller-2 --peer-urls=https://10.240.0.12:2380  --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Result:

Member 66d450d03498eb5c added to cluster 3e7cc799faffb625 ETCD_NAME="controller-2" ETCD_INITIAL_CLUSTER="controller-2=https://10.240.0.12:2380,controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.240.0.12:2380" ETCD_INITIAL_CLUSTER_STATE="existing"

Step 4 (run member list command):

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list   --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Result:

66d450d03498eb5c, unstarted, , https://10.240.0.12:2380,
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379 ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379

Step 5 (Run the command to start etcd in controller-2):

isaac@controller-2:~$ sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:
2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=http://10.240.0.10:2380,controller-1=http://10.240.0.11:2380,controller-2=http://10.240.0.1
2:2380 --ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem

Result:

2019-06-09 13:10:14.958799 I | etcdmain: etcd Version: 3.3.9 2019-06-09 13:10:14.959022 I | etcdmain: Git SHA: fca8add78 2019-06-09 13:10:14.959106 I | etcdmain: Go Version: go1.10.3 2019-06-09 13:10:14.959177 I | etcdmain: Go OS/Arch: linux/amd64 2019-06-09 13:10:14.959237 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1 2019-06-09 13:10:14.959312 W | etcdmain: no data-dir provided, using default data-dir ./controller-2.etcd 2019-06-09 13:10:14.959435 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2019-06-09 13:10:14.959575 C | etcdmain: cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented

Clearly, the etcd service did not start as expected, so I do the troubleshooting as below:

isaac@controller-2:~$ sudo systemctl status etcd

Result:

● etcd.service - etcd Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2019-06-09 13:06:55 UTC; 29min ago Docs: https://github.com/coreos Process: 1876 ExecStart=/usr/local/bin/etcd --name controller-2 --cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/etcd/kubernetes.pem --peer-key-file=/etc/etcd/kube Main PID: 1876 (code=exited, status=0/SUCCESS) Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer f98dc20bce6225a0 Jun 09 13:06:55 controller-2 etcd[1876]: stopping peer ffed16798470cab5... Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]: stopped HTTP pipelining with peer ffed16798470cab5 Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (stream MsgApp v2 reader) Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (stream Message reader) Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer ffed16798470cab5 Jun 09 13:06:55 controller-2 etcd[1876]: failed to find member f98dc20bce6225a0 in cluster 3e7cc799faffb625 Jun 09 13:06:55 controller-2 etcd[1876]: forgot to set Type=notify in systemd service file?

I indeed tried to start the etcd member using different commands but seems the etcd of controller-2 still stuck at unstarted state. May I know the reason of that? Any pointers would be highly appreciated! Thanks.

Be sure to delete the existing state directory from any newly joined etcd member, as it should sync the current state of the cluster from its peers (that's what the initial-cluster and initial-cluster-state variables are for); also, you'll want to fix cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented as that's not a good message — mdaniel

Isaac Wong Isaac Wong · Accepted Answer · 2019-06-10T15:41:11

Turned out I solved the problem as follows (credit to Matthew):

Delete the etcd data directory with the following command:

rm -rf  /var/lib/etcd/*

To fix the message cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented, I revised the command to start the etcd as follows:

sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380,controller-2=https://10.240.0.12:2380 --peer-trusted-ca-file  /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem --peer-cert-file /etc/etcd/kubernetes.pem --peer-key-file /etc/etcd/kubernetes-key.pem --data-dir /var/lib/etcd

A few points to note here:

The newly added arguments --cert-file and --key-file presented the required key and certificate of controller2.
Argument --peer-trusted-ca-file is also presented so as to check if the x509 certificate presented by controller0 and controller1 are signed by a known CA. If this is not presented, error etcdserver: could not get cluster response from https://10.240.0.11:2380: Get https://10.240.0.11:2380/members: x509: certificate signed by unknown authority may be encountered.
The value presented for the argument --initial-cluster needs to be in-line with that shown in the systemd unit file.

Adding removed etcd member in Kubernetes master

2 Answers