0
votes

I am trying have kubernetes create new pods on the most requested nodes instead of pods spreading the load across available nodes. The rationale is that this simplifies scale down scenarios and relaunching of application if pods gets moved and a node gets killed during autoscaling.

The preferred strategy for descaling is - 1) Never kill a node if there is any running pod 2) New pods are created preferentially on the most requested Nodes 3) The pods will self destruct after job completion. This should, over time, result in free nodes after the tasks are completed and thus descaling will be safe and I don't need to worry about resilience of the running jobs.

For this, is there any way I can specify the NodeAffinity in the pod spec, something like:

    spec:
      affinity:
        nodeAffinity:
          RequiredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              nodeAffinityTerm: {MostRequestedPriority}

The above code has no effect. The documentation for NodeAffinity doesn't specify if I can use MostRequestedPriority in this context. MostRequestedPriority is an option in the kubernetes scheduler service spec. But I am trying to see if I can directly put t in the pod spec, instead of creating a new custom kubernetes scheduler.

1

1 Answers

2
votes

Unfortunately there is no option to pass MostRequestedPriority to nodeAffinity field. However you can create simple scheduler to manage pod scheduling. Following configuration will be just enough.

First, you have to create Service Account and ClusterRoleBinding for this scheduler:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: own-scheduler
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: own-scheduler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: own-scheduler
  namespace: kube-system

Then create config map with desired policy field including MostRequestedPriority. Each field in predicates can be modified to suit your needs best and basically what it does is it filters the nodes to find where a pod can be placed, for example, the PodFitsResources filter checks whether a Node has enough available resource to meet a Pod’s specific resource requests:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
  k8s-addon: scheduler.addons.k8s.io
  name: own-scheduler
  namespace: kube-system
data:
  policy.cfg: |-
  {
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
  {"name" : "PodFitsHostPorts"},
  {"name" : "PodFitsResources"},
  {"name" : "NoDiskConflict"},
  {"name" : "PodMatchNodeSelector"},
  {"name" : "PodFitsHost"}
  ],
  "priorities" : [
  {"name" : "MostRequestedPriority", "weight" : 1},
  {"name" : "EqualPriorityMap", "weight" : 1}
  ]
  }

Then wrap it up in Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: own-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: own-scheduler
      containers:
      - command:
        - /usr/local/bin/kube-scheduler
        - --address=0.0.0.0
        - --leader-elect=false
        - --scheduler-name=own-scheduler
        - --policy-configmap=own-scheduler
        image: k8s.gcr.io/kube-scheduler:v1.15.4
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10251
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10251
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts: []
      hostNetwork: false
      hostPID: false
      volumes: []