1
votes

From searching and observing the behaviour of our cluster, it seems that if we have a HorizontalPodAutoscaler, when it scales down, it picks the youngest pod, or at least one of the youngest pods.

Is there a way we can get it to scale down the oldest one?

What is the rationale for scaling down the newest one? We end up having many pods that live for a very short period of time, and a few that live for a very long time. (Those very long-lived pods end up with a lot of memory use.)

Or perhaps we could get it to pick at random?

The other problem that picking the youngest pod causes is that we have an affinity for nodes with a characteristic and it's leaving the old pods on the other nodes and always scaling down the ones on the nodes we want them on.

1
Well, I guess to the extent that the answer is "No", it does. But the last of the bullet points, that if 2 pods differ only by age, it removes the younger one. Seems like it should not have that or it should be the other way around. Removing the younger one undermines preferential affinities and flies in the face of what I think of as conventional wisdom. - nroose
And it certainly does not explain the rational for that part of the priority calculation, which is part of my question. - nroose
There is no way to configure which pods will be deleted. As showed in LauriKoskeka link, there are certain criterias to delete the pod, but you can force you own rule. But if you are using affinity the rule should follow run the pods in the nodes you want. What is the problem with that? - Mr.KoopaKiller
The question is why it chooses the youngest one. What is unclear about that? The problem with affinity is that when it adds a new pod scaling up, it puts it on our preferred node pool, but then when it scales down, it removes that one, so if we are not already mostly on the preferred pool, we never get to mostly on the preferred pool. That seems pretty clear to me too. I guess your mileage may vary. - nroose

1 Answers

0
votes

I believe this behaviour is not related to HPA, but to controller.

As you can see here you can see there are different conditions for this rule:

// 1. If only one of the pods is assigned to a node, the pod that is not // assigned comes before the pod that is.

// 2. If the pods' phases differ, a pending pod comes before a pod whose phase // is unknown, and a pod whose phase is unknown comes before a running pod.

// 3. If exactly one of the pods is ready, the pod that is not ready comes // before the ready pod.

// 4. If the pods' ranks differ, the pod with greater rank comes before the pod // with lower rank.

// 5. If both pods are ready but have not been ready for the same amount of // time, the pod that has been ready for a shorter amount of time comes // before the pod that has been ready for longer.

// 6. If one pod has a container that has restarted more than any container in // the other pod, the pod with the container with more restarts comes // before the other pod.

// 7. If the pods' creation times differ, the pod that was created more recently // comes before the older pod.

You can see something related discussed here

Also, seems that this feature is progress to be implemented as discussed here and here