How to use CPU effectively with a large number of inactive pods in Kubernetes?

Question

I have many services. In a day, a few services are busy for about ten hours, while most other services are idle or use a small amount of cpu.

In the past, I put all services in a virtual machine with two cpus, and scale by cpu usage, there are two virtual machine at the busiest time, but most of the time there is only one.

services	instances	busy time in a day	cpu when busy (core/service)	cpu when idle (core/service)
busy services	2	8~12 hours	0.5~1	0.1~0.5
busy services	2	8~12 hours	0.3~0.8	0.1~0.3
inactive services	30	0~1 hours	0.1~0.3	< 0.1

Now, I want to put them in kubernetes, each node has two CPUs, and use node autoscaling and HPA, in order to make the node autoscaling, I must set requests CPU for all services, which is exactly the difficulty I encountered.

This is my setting.

services	instances	busy time	requests cpu (cpu/service)	total requests cpu
busy services	2	8~12 hours	300m	600m
busy services	2	8~12 hours	300m	600m
inactive services	30	0~1 hours	100m	3000m

Note: The inactive service requests CPU is set to 100m because it will not work well if it is less than 100m when it is busy.

With this setting, the number of nodes will always be greater than three, which is too costly. I think the problem is that although these services require 100m of CPU to work properly, they are mostly idle.

I really hope that all services can autoscaling, I think this is the benefit of kubernetes, which can help me assign pods more flexibly. Is my idea wrong? Shouldn't I set a request CPU for an inactive service?

Even if I ignore inactive services. I find that kubernetes more often has more than two nodes. If I have more active services, even in off-peak hours, the requests CPU will exceed 2000m. Is there any solution?

How about doing this: you can analyse the situation with kubectl top pods -A. Now you know how much they are consuming. 1 core = 1000m. So you can set a bit higher amount for requests from kubectl top pods output, but lower than 100m in some cases and high limits values. So pod resourcing will vary between requests (guaranteed) and limits (hard limit) range. Based on the output, you know where you can give less than 100m, if pod is not CPU intense at some moments in a day. You may can overbook limits if you planned requests and limits well. — laimison
Currently, setting smaller requests for inactive services can improve the problem, but based on other answers I found some potential problems. When these services are busy, the node will not be able to scale. I don't have a better idea at the moment. — Fulo Lin

Jonas Jonas · Accepted Answer · 2021-03-30T17:45:50

I put all services in a virtual machine with two cpus, and scale by cpu usage, there are two virtual machine at the busiest time, but most of the time there is only one.

First, if you have any availability requirements, I would recommend to always have at least two nodes. If you have only one node and that one crash (e.g. hardware failure or kernel panic) it will take some minutes before this is detected and it will take some minutes before a new node is up.

The inactive service requests cpu is set to 100m because it will not work well if it is less than 100m when it is busy.

I think the problem is that although these services require 100m of cpu to work properly, they are mostly idle.

The CPU request is a guaranteed reserved resource amount. Here you reserve too much resources for your almost idling services. Set the CPU request lower, maybe as low as 20m or even 5m? But since these services will need more resources during busy periods, set a higher limit so that the container can "burst" and also use Horizontal Pod Autoscaler for these. When using the Horizontal Pod Autoscaler more replicas will be created and the traffic will be load balanced across all replicas. Also see Managing Resources for Containers.

This is also true for your "busy services", reserve less CPU resources and use Horizontal Pod Autoscaling more actively so that the traffic is spread to more nodes during high load, but can scale down and save cost when the traffic is low.

I really hope that all services can autoscaling, I think this is the benefit of kubernetes, which can help me assign pods more flexibly. Is my idea wrong?

Yes, I agree with you.

Shouldn't I set a request cpu for an inactive service?

It is a good practice to always set some value for request and limit, at least for a production environment. The scheduling and autoscaling will not work well without resource requests.

If I have more active services, even in off-peak hours, the requests cpu will exceed 2000m. Is there any solution?

In general, try to use lower resource requests and use Horizontal Pod Autoscaling more actively. This is true for both your "busy services" and your "inactive services".

I find that kubernetes more often has more than two nodes.

Yes, there are two aspects of this.

If you only use two nodes, your environment probably is small and the Kubernetes control plane probably consists of more nodes and is the majority of the cost. For very small environments, Kubernetes may be expensive and it would be more attractive to use e.g. a serverless alternative like Google Cloud Run

Second, for availability. It is good to have at least two nodes in case of an abrupt crash e.g. hardware failure or a kernel panic, so that your "service" is still available meanwhile the node autoscaler scales up a new node. This is also true for the number of replicas for a Deployment, if availability is important, use at least two replicas. When you e.g. drain a node for maintenance or node upgrade, the pods will be evicted - but not created on a different node first. The control plane will detect that the Deployment (technically ReplicaSet) has less than the desired number of replicas and create a new pod. But when a new Pod is created on a new node, the container image will first be pulled before the Pod is running. To avoid downtime during these events, use at least two replicas for your Deployment and Pod Topology Spread Constraints to make sure that those two replicas run on different nodes.

Note: You might run into the same problem as How to use K8S HPA and autoscaler when Pods normally need low CPU but periodically scale and that should be mitigated by an upcoming Kubernetes feature: KEP - Trimaran: Real Load Aware Scheduling

How to use CPU effectively with a large number of inactive pods in Kubernetes?

1 Answers