0
votes

Assume that there are some pods from Deployments/StatefulSet/DaemonSet, etc. running on a Kubernetes node.

Then I restarted the node directly, and then start docker, start kubelet with the same parameters.

What would happen to those pods?

  1. Are they recreated with metadata saved locally from kubelet? Or use info retrieved from api-server? Or recovered from OCI runtime and behaves like nothing happened?
  2. Is it that only stateless pod(no --local-data) can be recovered normally? If any of them has a local PV/dir, would they be connected back normally?
  3. What if I did not restart the node for a long time? Would api-server assign other nodes to create those pods? What is the default timeout value? How can I configure this?

As far as I know:

 apiserver
    ^
    |(sync)
    V
  kubelet
    ^
    |(sync)
    V
-------------
| CRI plugin |(like api)
| containerd |(like api-server)
|    runc    |(low-level binary which manages container)
| c' runtime |(container runtime where containers run)
-------------

When kubelet received a PodSpec from kube-api-server, it calls CRI like a remote service, the steps be like:

  1. create PodSandbox(a.k.a 'pause' image, always 'stopped')
  2. create container(s)
  3. run container(s)

So I guess that as the node and docker being restarted, steps 1 and 2 are already done, containers are at 'stopped' status; Then as kubelet being restarted, it pulls latest info from kube-api-server, found out that container(s) are not in 'running' state, so it calls CRI to run container(s), then everything are back to normal.

Please help me confirm.

Thank you in advance~

3
Never used Kubernetes, but from the amount of exposure I have with Docker, your node should start normally as you say that it's restart. That's essentially a terminate followed by a initiate operation. And the condition should be the same for both stateful and stateless pods. And your 3rd questions answer too should be yes, cause that's what Kubernetes is used for.Debargha Roy

3 Answers

3
votes

Good questions. A few things first; a Pod is not pinned to a certain node. The nodes is mostly seen as a "server farm" that Kubernetes can use to run its workload. E.g. you give Kubernetes a set of nodes and you also give a set of e.g. Deployment - that is desired state of applications that should run on your servers. Kubernetes is responsible for scheduling these Pods and also keep them running when something in the cluster is changed.

Standalone pods is not managed by anything, so if a Pod crashes it is not recovered. You typically want to deploy your stateless apps as Deployments, that then initiates ReplicaSets that manage a set of Pods - e.g. 4 Pods - instances of your app.

Your desired state; a Deployment with e.g. replicas: 4 is saved in the etcd database within the Kubernetes control plane.

Then a set of controllers for Deployment and ReplicaSet is responsible for keeping 4 replicas of your app alive. E.g. if a node becomes unresponsible (or dies), new pods will be created on other Nodes, if they are managed by the controllers for ReplicaSet.

A Kubelet receives a PodSpecs that are scheduled to the node, and then keep these pods alive by regularly health checks.

Is it that only stateless pod(no --local-data) can be recovered normally?

Pods should be seen as emphemeral - e.g. can disappear - but is recovered by a controller that manages them - unless deployed as standalone Pod. So don't store local data within the pod.

There is also StatefulSet pods, those are meant for stateful workload - but distributed stateful workload, typically e.g. 3 pods, that use Raft to replicate data. The etcd database is an example of distributed database that uses Raft.

1
votes

The correct answer: it depends.

Imagine, you've got 3 nodes cluster, where you created a Deployment with 3 replicas, and 3-5 standalone pods. Pods are created and scheduled to nodes.
Everything is up and running.

Let's assume that worker node node1 has got 1 deployment replica and 1 or more standalone pods.

The general sequence of node restart process as follows:

  1. The node gets restarted, for ex. using sudo reboot
  2. After restart, the node starts all OS processes in the order specified by systemd dependencies
  3. When dockerd is started it does nothing. At this point all previous containers has Exited state.
  4. When kubelet is started it requests the cluster apiserver for the list of Pods with node property equals its node name.
  5. After getting the reply from apiserver, kubelet starts containers for all pods described in the apiserver reply using Docker CRI.
  6. When pause container starts for each Pod from the list, it gets new IP address configured by CNI binary, deployed by Network addon Daemonset's Pod.
  7. After kube-proxy Pod is started on the node, it updates iptables rules to implement Kubernetes Services desired configuration, taking to account new Pods' IP addresses.

Now things become a bit more complicated.

Depending on apiserver, kube-controller-manager and kubelet configuration, they reacts on the fact that node is not responding with some delay.

If the node restarts fast enough, kube-controller-manager doesn't evict the Pods and they all remain scheduled on the same node increasing their RESTARTS number after their new containers become Ready.

Example 1.

The cluster is created using Kubeadm with Flannel network addon on Ubuntu 18.04 VM created in GCP.
Kubernetes version is v1.18.8
Docker version is 19.03.12

After the node is restarted, all Pods' containers are started on the node with new IP addresses. Pods keep their names and location.

If node is stopped for a long time, the pods on the node stays in Running state, but connection attempts are obviously timed out.

If node remains stopped, after approximately 5 minutes pods scheduled on that node were evicted by kube-controller-manager and terminated. If I would start node before that eviction all pods were remained on the node.

In case of eviction, standalone Pods disappear forever, Deployments and similar controllers create necessary number of pods to replace evicted Pods and kube-scheduler puts them to appropriate nodes. If new Pod can't be scheduled on another node, for ex. due to lack of required volumes it will remain in Pending state, until the scheduling requirements were satisfied.

On a cluster created using Ubuntu 18.04 Vagrant box and Virtualbox hypervisor with host-only adapter dedicated for Kubernetes networking, pods on stopped node remains in the Running, but with Readiness: false state even after two hours, and were never evicted. After starting the node in 2 hours all containers were restarted successfully.
This configuration's behavior is the same all the way from Kubernetes v1.7 till the latest v1.19.2.

Example 2.

The cluster is created in Google cloud (GKE) with the default kubenet network addon:
Kubernetes version is 1.15.12-gke.20 Node OS is Container-Optimized OS (cos)

After the node is restarted (it takes around 15-20 seconds) all pods are started on the node with new IP addresses. Pods keep their names and location. (same with example 1)

If the node is stopped, after short period of time (T1 equals around 30-60 seconds) all pods on the node change status to Terminating. Couple minutes later they disappear from the Pods list. Pods managed by Deployment are rescheduled on other nodes with new names and ip addresses.

If the node pool is created with Ubuntu nodes, apiserver terminates Pods later, T1 equals around 2-3 minutes.


The examples show that the situation after worker node gets restarted is different for different clusters, and it's better to run the experiment on a specific cluster to check if you can get the expected results.

How to configure those timeouts:

-1
votes

When the node is restarted and there are pods scheduled on it, managed by Deployment or ReplicaSet, those controllers will take care of scheduling desired number of replicas on another, healthy node. So if you have 2 replicas running on restarted node, they will be terminated and scheduled on other node.

Before restarting a node you should use kubectl cordon to mark the node as unschedulable and give kubernetes time to reschedule pods.

Stateless pods will not be rescheduled on any other node, they will be terminated.