nodeSelector does not reliably place pods on the correct EKS worker nodes

Question

I'm running a Kubernetes cluster in EKS, but for some reason the nodeSelector attribute on a deployment isn't always being followed.

Three deployments: 1 - Cassandra:

kind: StatefulSet
metadata:
  name: cassandra
  labels:
    app: cassandra
spec:
  serviceName: cassandra
  replicas: 3
...
    spec:
      terminationGracePeriodSeconds: 1800
      containers:
      - name: cassandra
        image: gcr.io/google-samples/cassandra:v13
...
      nodeSelector:
        layer: "backend"

2 - Kafka

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    service: kafka
...
    spec:
      containers:
        image: strimzi/kafka:0.11.3-kafka-2.1.0
...
      nodeSelector:
        layer: "backend"
...

3 - Zookeeper

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    service: zookeeper
...
    spec:
      containers:
        image: strimzi/kafka:0.11.3-kafka-2.1.0
...
      nodeSelector:
        layer: "backend"
...

Note - all three have the nodeSelector "layer=backend" on container spec. I only have 2 "backend" pods, however, when I look at the pods I see:

% kubectl get all -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
pod/cassandra-0                  1/1     Running   0          9m32s   10.1.150.39    ip-...-27.us-west-2.compute.internal    <none>           <none>
pod/cassandra-1                  1/1     Running   0          7m56s   10.1.100.7     ip-...-252.us-west-2.compute.internal   <none>           <none>
pod/cassandra-2                  1/1     Running   0          6m46s   10.1.150.254   ip-...-27.us-west-2.compute.internal    <none>           <none>
pod/kafka-56dcd8665d-hfvz4       1/1     Running   0          9m32s   10.1.100.247   ip-...-252.us-west-2.compute.internal   <none>           <none>
pod/zookeeper-7f74f96f56-xwjjt   1/1     Running   0          9m32s   10.1.100.128   ip-...-154.us-west-2.compute.internal   <none>           <none>

They are placed on three different nodes - 27, 252 and 154. Looking at the "layer" label on each of those:

> kubectl describe node ip-...-27.us-west-2.compute.internal | grep layer
                    layer=backend
> kubectl describe node ip-...-252.us-west-2.compute.internal | grep layer
                    layer=backend
> kubectl describe node ip-...-154.us-west-2.compute.internal | grep layer
                    layer=perf

The 154 node has a label of "perf", not "backend". So per my understanding of nodeSelector, the zookeeper pod shouldn't have been put there. I've deleted everything (including the nodes themselves) and tried a few times - sometimes it's kafka that gets put there, sometimes zookeeper, but reliably something gets put where it shouldn't.

As near as I can tell, the nodes I do want have plenty of capacity, and even if they didn't I would expect an error that the pod couldn't be scheduled rather than ignoring the nodeSelector.

What am I missing? Is nodeSelector not 100% reliable? Is there another way I can force pods to only be placed on nodes with specific labels?

With the amount of detail you have elided, there is no way we can help you. It is exceptionally unlikely that kubernetes -- even its dumb EKS cousin -- spontaneously stopped honoring nodeSelector: and far more likely that your experiment is flawed — mdaniel
Was the nodeSelector configured from the start? or was it added in the middle? — Tummala Dhanvi
if it was added after the pod was already created, just restart the pod and it will go into the correct node if it hasn't already. — Tummala Dhanvi
In addition to what others have said, the more traditional way to do this would be taints. Make a node group of your fancy instance type and set it to be tainted with a key like backend and then apply a matching toleration to the pods. — coderanger

DrTeeth DrTeeth · Accepted Answer · 2020-03-11T12:31:18

Close as user error.

A separate process had reverted my git changes, and the deployment I was looking at in my IDE was stale.

nodeSelector does not reliably place pods on the correct EKS worker nodes

1 Answers