By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods. It is possible to configure kubelet
to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.
In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node. If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.
It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible.
If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads.
If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.
Capacity and Allocatable Resources
The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node.
Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.
The allocatable memory available is the total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods
.
Scheduled Pods
The value for scheduled-pods
can be calculated via a dynamic cgroup, or statically via the pods resource requests.
The kubelet --cgroups-per-qos
option which defaults to true
enables cgroup tracking of scheduled pods. The pods kubernetes runs will be in
If --cgroups-per-qos=false
then the allocatable memory will only be reduced by the resource requests that scheduled on a node.
Eviction Threshold
The eviction-threshold
is the level of free memory when Kubernetes starts evicting pods. This defaults to 100MB but can be set via the kubelet command line. This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.
System Reserved
kubelets system-reserved
value can be configured as a static value (--system-reserved=
) or monitored dynamically via cgroup (--system-reserved-cgroup=
).
This is for any system daemons running outside of kubernetes (sshd
, systemd
etc). If you configure a cgroup, the processes all need to be placed in that cgroup.
Kube Reserved
kubelets kube-reserved
value can be configured as a static value (via --kube-reserved=
) or monitored dynamically via cgroup (--kube-reserved-cgroup=
).
This is for any kubernetes services running outside of kubernetes, usually kubelet
and a container runtime.
Capacity and Availability on a Node
Capacity is stored in the Node object.
$ kubectl get node node01 -o json | jq '.status.capacity'
{
"cpu": "2",
"ephemeral-storage": "61252420Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "4042284Ki",
"pods": "110"
}
The allocatable value can be found on the Node, you can note than existing usage doesn't change this value. Only schduleding pods with resource requests will take away from the allocatable
value.
$ kubectl get node node01 -o json | jq '.status.allocatable'
{
"cpu": "2",
"ephemeral-storage": "56450230179",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "3939884Ki",
"pods": "110"
}
Memory Usage and Pressure
A kube node can also have a "memory pressure" event. This check is done outside of the allocatable resource checks above and is more a system level catch all. Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free
does to remove the file cache.
A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.
You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi]
flag. The memory requests and usage for pods can help informs the eviction process.
kubectl top node
will give you the existing memory stats for each node (if you have a metrics service running).
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node01 141m 7% 865Mi 22%
If you were not using cgroups-per-qos
and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.
Memory Pressure calculation
Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process:
# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.
# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')
memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
memory_working_set=0
else
memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi
memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))
echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"