0
votes

I was following Kelsey Hightower's kubernetes-the-hard-way repo and successfully created a cluster with 3 master nodes and 3 worker nodes. Here are the problems encountered when removing one of the etcd members and then adding it back, also with all the steps used:

3 master nodes:
10.240.0.10 controller-0
10.240.0.11 controller-1
10.240.0.12 controller-2

Step 1:

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list   --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Result:

b28b52253c9d447e, started, controller-2, https://10.240.0.12:2380, https://10.240.0.12:2379
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379
ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379

Step 2 (Remove etcd member of controller-2):

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member remove b28b52253c9d447e   --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Step 3 (Add the member back):

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member add controller-2 --peer-urls=https://10.240.0.12:2380  --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Result:

Member 66d450d03498eb5c added to cluster 3e7cc799faffb625 ETCD_NAME="controller-2" ETCD_INITIAL_CLUSTER="controller-2=https://10.240.0.12:2380,controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.240.0.12:2380" ETCD_INITIAL_CLUSTER_STATE="existing"

Step 4 (run member list command):

isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list   --endpoints=https://127.0.0.1:2379   --cacert=/etc/etcd/ca.pem   --cert=/etc/etcd/kubernetes.pem   --key=/etc/etcd/kubernetes-key.pem

Result:

66d450d03498eb5c, unstarted, , https://10.240.0.12:2380,
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379 ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379

Step 5 (Run the command to start etcd in controller-2):

isaac@controller-2:~$ sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:
2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=http://10.240.0.10:2380,controller-1=http://10.240.0.11:2380,controller-2=http://10.240.0.1
2:2380 --ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem

Result:

2019-06-09 13:10:14.958799 I | etcdmain: etcd Version: 3.3.9 2019-06-09 13:10:14.959022 I | etcdmain: Git SHA: fca8add78 2019-06-09 13:10:14.959106 I | etcdmain: Go Version: go1.10.3 2019-06-09 13:10:14.959177 I | etcdmain: Go OS/Arch: linux/amd64 2019-06-09 13:10:14.959237 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1 2019-06-09 13:10:14.959312 W | etcdmain: no data-dir provided, using default data-dir ./controller-2.etcd 2019-06-09 13:10:14.959435 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2019-06-09 13:10:14.959575 C | etcdmain: cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented

Clearly, the etcd service did not start as expected, so I do the troubleshooting as below:

isaac@controller-2:~$ sudo systemctl status etcd

Result:

● etcd.service - etcd Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2019-06-09 13:06:55 UTC; 29min ago Docs: https://github.com/coreos Process: 1876 ExecStart=/usr/local/bin/etcd --name controller-2 --cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/etcd/kubernetes.pem --peer-key-file=/etc/etcd/kube Main PID: 1876 (code=exited, status=0/SUCCESS) Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer f98dc20bce6225a0 Jun 09 13:06:55 controller-2 etcd[1876]: stopping peer ffed16798470cab5... Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]: stopped HTTP pipelining with peer ffed16798470cab5 Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (stream MsgApp v2 reader) Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (stream Message reader) Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer ffed16798470cab5 Jun 09 13:06:55 controller-2 etcd[1876]: failed to find member f98dc20bce6225a0 in cluster 3e7cc799faffb625 Jun 09 13:06:55 controller-2 etcd[1876]: forgot to set Type=notify in systemd service file?

I indeed tried to start the etcd member using different commands but seems the etcd of controller-2 still stuck at unstarted state. May I know the reason of that? Any pointers would be highly appreciated! Thanks.

2
Be sure to delete the existing state directory from any newly joined etcd member, as it should sync the current state of the cluster from its peers (that's what the initial-cluster and initial-cluster-state variables are for); also, you'll want to fix cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented as that's not a good messagemdaniel
Thanks for the hints @MatthewLDaniel.Isaac Wong

2 Answers

1
votes

Turned out I solved the problem as follows (credit to Matthew):

  1. Delete the etcd data directory with the following command:
rm -rf  /var/lib/etcd/*
  1. To fix the message cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented, I revised the command to start the etcd as follows:
sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380,controller-2=https://10.240.0.12:2380 --peer-trusted-ca-file  /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem --peer-cert-file /etc/etcd/kubernetes.pem --peer-key-file /etc/etcd/kubernetes-key.pem --data-dir /var/lib/etcd

A few points to note here:

  1. The newly added arguments --cert-file and --key-file presented the required key and certificate of controller2.
  2. Argument --peer-trusted-ca-file is also presented so as to check if the x509 certificate presented by controller0 and controller1 are signed by a known CA. If this is not presented, error etcdserver: could not get cluster response from https://10.240.0.11:2380: Get https://10.240.0.11:2380/members: x509: certificate signed by unknown authority may be encountered.
  3. The value presented for the argument --initial-cluster needs to be in-line with that shown in the systemd unit file.
0
votes

if you are re-adding the more easy solution is following

rm -rf  /var/lib/etcd/*
kubeadm join phase control-plane-join etcd --control-plane