1
votes

I'm scaling ML prediction based on CPU and memory utilization. I have used HPA for pod level scaling, where we specified the both CPU and memory metrics. While creating deployment, I have also specified the requests computing resources and limits for the same.(I have pasted both HPA configuration and pod template configuration is for reference)

I observed that although we specified the Resources limit and request, when I check the memory and CPU consumed by each pod, It's shows only one pod is consuming all the CPU and memory resources and other are consuming very less computing resources. As per my understanding all pods should consume approx equals resources so we can say it's scaled otherwise it's like running the code without k8s on single machine.

Note: I'm using python k8s client for creating the deployment and services not the yaml.

I have tried with tweaking with limits and resources parameter and observed that because of this is ML pipeline, memory and cpu consumption boost exponentially at some point.

My HPA configuration:-

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  namespace: default
  name: #hpa_name
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: #deployment_name
  minReplicas: 1
  maxReplicas: 40
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 80
    - type: Resource
      resource:
        name: memory
        targetAverageValue: 5Gi

My pod template code-

 container = client.V1Container(
        ports=[client.V1ContainerPort(container_port=8080)],
        env=[client.V1EnvVar(name="ABC", value=12345)],
        resources=client.V1ResourceRequirements(
            limits={"cpu": "2", "memory": "22Gi"},
            requests={"cpu": "1", "memory": "8Gi"},
        ),

Output of kubectl top pods

NAME                                                              CPU(cores)   MEMORY(bytes)   
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-77c6ds   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7d5n4l   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7dq6c9   14236m       16721Mi         
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7f6nmh   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7fzdc4   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7gvqtj   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7h6ld7   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7j7gv4   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7kxlvh   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7nnn8x   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7pmtnj   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7qflkh   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7s26cj   1m           176Mi           
deploy-24b32e5dc388456f8f2263be39ffb5f7-de19236511504877-7st5lt   1m           176Mi           

By above output , it is very clear that , third pod utilizing the most of the resources while other are at constant memory and CPU consumption which is very less.

My expectation is each pod should consume the resource approx equals based on limits and resources specified in requests of pod template. In this case it should be between 1 CPU and 2 CPU for CPU consumption and between 8 Gi to 22 Gi for memory/ less than requested resource but not beyond defined limits.

Thanks in advance for any points/help/hints.

1
I'm not very familiar with ML applications of this stuff, but I'm curious how each of these new pods gets work to do? They look like they're not doing anything to share the load. Do you have a service in front of this? Do they all check in with some sort of control plane or scheduler?switchboard.op
we have exposed the service at NodePort and accessing this via <NodePortIp>:<port>.Anurag Srivastava
do you submit work to the pods with many different calls to that nodeport? 'cause those should be split among service endpoints. or do you submit one big job in a single request? also how long does it take one of these pods to be ready?switchboard.op
We submit one big job in a single request. Ours Application is GRPC based and GRPC is based on HTTP/2 which have one long TCP connection and request is multiplexed on the same TCP connection in order to minimize the TCP connection management overhead. I guess this is the issue, but not sure!.Anurag Srivastava
@switchboard.op, thanks. Yes we found a way to distribute the load. We used linkerd for this. we followed the solution describes in kubernetes.io/blog/2018/11/07/… and it's worked.Anurag Srivastava

1 Answers

2
votes

As per RCA (Root cause analysis) of this issue, we verified by running ipvsadm -ln while processing a workload in ours kubernetes cluster and found that only one TCP connection is made by payload and this causes all the workload to be concentrated in one pod even though other pods are available.

Ours application is based on GRPC and GRPC uses HTTP/2. HTTP/2 have feature to create single long lived TCP connection and request is multiplexed under the same TCP connection to minimize the TCP connection management overhead. Because of this, there was one long lived TCP connection attach to the single pod and since HPA spikes for memory and CPU, its scales the pods but load do not get distributed. Thus, we need some mechanism to distribute the load one step next to connection level load balancing(this is default load balancing in kubernetes) to request level load balancing.

Fortunate we found below solution, we followed this and it's worked for us.

https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/