Does Kubernetes consider the current memory usage when scheduling pods

Question

The Kubernetes docs on https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ state:

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.

Does Kubernetes consider the current state of the node when calculating capacity? To highlight what I mean, here is a concrete example:

Assuming I have a node with 10Gi of RAM, running 10 Pods each with 500Mi of resource requests, and no limits. Let's say they are "bursting", and each Pod is actually using 1Gi of RAM. In this case, the node is fully utilized (10 x 1Gi = 10Gi), but the resources requests are only 10 x 500Mi = 5Gi. Would Kubernetes consider scheduling another pod on this node because only 50% of the memory capacity on the node has been requested, or would it use the fact that 100% of the memory is currently being utilized, and the node is at full capacity?

Matt Matt · Accepted Answer · 2019-06-27T23:06:44

By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods. It is possible to configure kubelet to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.

In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node. If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.

It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible. If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads. If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.

Capacity and Allocatable Resources

The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node.

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.

The allocatable memory available is the total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods.

Scheduled Pods

The value for scheduled-pods can be calculated via a dynamic cgroup, or statically via the pods resource requests.

The kubelet --cgroups-per-qos option which defaults to true enables cgroup tracking of scheduled pods. The pods kubernetes runs will be in

If --cgroups-per-qos=false then the allocatable memory will only be reduced by the resource requests that scheduled on a node.

Eviction Threshold

The eviction-threshold is the level of free memory when Kubernetes starts evicting pods. This defaults to 100MB but can be set via the kubelet command line. This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.

System Reserved

kubelets system-reserved value can be configured as a static value (--system-reserved=) or monitored dynamically via cgroup (--system-reserved-cgroup=). This is for any system daemons running outside of kubernetes (sshd, systemd etc). If you configure a cgroup, the processes all need to be placed in that cgroup.

Kube Reserved

kubelets kube-reserved value can be configured as a static value (via --kube-reserved=) or monitored dynamically via cgroup (--kube-reserved-cgroup=). This is for any kubernetes services running outside of kubernetes, usually kubelet and a container runtime.

Capacity and Availability on a Node

Capacity is stored in the Node object.

$ kubectl get node node01 -o json | jq '.status.capacity'
{
  "cpu": "2",
  "ephemeral-storage": "61252420Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "4042284Ki",
  "pods": "110"
}

The allocatable value can be found on the Node, you can note than existing usage doesn't change this value. Only schduleding pods with resource requests will take away from the allocatable value.

$ kubectl get node node01 -o json | jq '.status.allocatable'
{
  "cpu": "2",
  "ephemeral-storage": "56450230179",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "3939884Ki",
  "pods": "110"
}

Memory Usage and Pressure

A kube node can also have a "memory pressure" event. This check is done outside of the allocatable resource checks above and is more a system level catch all. Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free does to remove the file cache.

A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.

You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi] flag. The memory requests and usage for pods can help informs the eviction process.

kubectl top node will give you the existing memory stats for each node (if you have a metrics service running).

$ kubectl top node
NAME                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
node01               141m         7%     865Mi           22%

If you were not using cgroups-per-qos and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.

Memory Pressure calculation

Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process:

# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.

# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
    memory_working_set=0
else
    memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"