8
votes

The Kubernetes docs on https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ state:

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.

Does Kubernetes consider the current state of the node when calculating capacity? To highlight what I mean, here is a concrete example:

Assuming I have a node with 10Gi of RAM, running 10 Pods each with 500Mi of resource requests, and no limits. Let's say they are "bursting", and each Pod is actually using 1Gi of RAM. In this case, the node is fully utilized (10 x 1Gi = 10Gi), but the resources requests are only 10 x 500Mi = 5Gi. Would Kubernetes consider scheduling another pod on this node because only 50% of the memory capacity on the node has been requested, or would it use the fact that 100% of the memory is currently being utilized, and the node is at full capacity?

3

3 Answers

9
votes

By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods. It is possible to configure kubelet to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.

In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node. If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.

It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible. If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads. If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.

Capacity and Allocatable Resources

The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node.

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.

The allocatable memory available is the total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods.

Scheduled Pods

The value for scheduled-pods can be calculated via a dynamic cgroup, or statically via the pods resource requests.

The kubelet --cgroups-per-qos option which defaults to true enables cgroup tracking of scheduled pods. The pods kubernetes runs will be in

If --cgroups-per-qos=false then the allocatable memory will only be reduced by the resource requests that scheduled on a node.

Eviction Threshold

The eviction-threshold is the level of free memory when Kubernetes starts evicting pods. This defaults to 100MB but can be set via the kubelet command line. This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.

System Reserved

kubelets system-reserved value can be configured as a static value (--system-reserved=) or monitored dynamically via cgroup (--system-reserved-cgroup=). This is for any system daemons running outside of kubernetes (sshd, systemd etc). If you configure a cgroup, the processes all need to be placed in that cgroup.

Kube Reserved

kubelets kube-reserved value can be configured as a static value (via --kube-reserved=) or monitored dynamically via cgroup (--kube-reserved-cgroup=). This is for any kubernetes services running outside of kubernetes, usually kubelet and a container runtime.

Capacity and Availability on a Node

Capacity is stored in the Node object.

$ kubectl get node node01 -o json | jq '.status.capacity'
{
  "cpu": "2",
  "ephemeral-storage": "61252420Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "4042284Ki",
  "pods": "110"
}

The allocatable value can be found on the Node, you can note than existing usage doesn't change this value. Only schduleding pods with resource requests will take away from the allocatable value.

$ kubectl get node node01 -o json | jq '.status.allocatable'
{
  "cpu": "2",
  "ephemeral-storage": "56450230179",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "3939884Ki",
  "pods": "110"
}

Memory Usage and Pressure

A kube node can also have a "memory pressure" event. This check is done outside of the allocatable resource checks above and is more a system level catch all. Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free does to remove the file cache.

A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.

You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi] flag. The memory requests and usage for pods can help informs the eviction process.

kubectl top node will give you the existing memory stats for each node (if you have a metrics service running).

$ kubectl top node
NAME                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
node01               141m         7%     865Mi           22%       

If you were not using cgroups-per-qos and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.

Memory Pressure calculation

Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process:

# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.

# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
    memory_working_set=0
else
    memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"
1
votes

Definitely YES, Kubernetes consider memory usage during pod scheduling process.

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.

There are two key concepts in scheduling. First one, the scheduler attempts to filter the nodes that are capable of running a given pod based on resource requests and other scheduling requirements. Second, the scheduler weighs the eligible nodes based on absolute and relative resource utilization of the nodes and other factors. The highest weighted eligible node is selected for scheduling of the pod. Good explanation of scheduling in Kuberneres you can find here: kubernetes-scheduling.

Simple example: your pod normally uses 100 Mi of ram but you run it with a 50 Mi request. If you have a node with 75 Mi free the scheduler may choose to run the pod there. When pod memory consumption later expands to 100 Mi this puts the node under pressure, at which point the kernel may choose to kill your process. So it is important to get both memory requests and memory limits right. About memory usage, requests and limits you can read more here: memory-resource.

A container can exceed its memory request if the node has memory available. But a container is not allowed to use more than its memory limit. If a container allocates more memory than its limit, the container becomes a candidate for termination. If the container continues to consume memory beyond its limit, the container is terminated. If a terminated container can be restarted the kubelet restarts it, as with any other type of runtime failure.

I hope its helps.

0
votes

Yes, Kubernetes will consider current memory usage when scheduling Pods (not just requests), so your new Pod wouldn't get scheduled on the full node. Of course, there are also a number of other factors.

(FWIW, when it comes to resources, a request helps the scheduler by declaring a baseline value, and a limit kills the Pod when resources exceed that value, which helps with capacity planning/estimation.