How to scale Azure's Kubernetes Service (AKS) based on GPU metrics?

Question

Problem

I am trying to implement a Horizontal Pod Autoscaler (HPA) on my AKS cluster. However, I'm unable to retrieve the GPU metrics (auto-generated by Azure) that my HPA requires to scale.

Example

As a reference, see this example where the HPA scales based on targetCPUUtilizationPercentage: 50. That is, the HPA will deploy more/less pods to achieve a target of an average CPU utilization across all pods. Ideally, I want to achieve the same with the GPU.

Setup

I have deployed an AKS cluster with Azure Monitor enabled and my node size set to Standard_NC6_Promo - Azure's VM option that comes equipped with Nvidia's Tesla K80 GPU. However, in order to utilize the GPU, you must first install the appropriate plugin into your cluster, as explained here. Once you install this plugin a number of GPU metrics are automatically collected by Azure and logged to a table named "InsightsMetrics" (see). From what I can read, the metric containerGpuDutyCycle will be the most beneficial for monitoring GPU utilization.

Current Situation

I can successfully see the insight metrics gathered by installed plugin, where one of the metrics is containerGpuDutyCycle.

InsightsMetrics table inside of Logs tab of Kubernetes Service on Azure Portal

Now how to expose/provide this metric to my HPA?

Possible Solutions

What I've noticed is that if you navigate to the Metrics tab of your AKS cluster, you cannot retrieve these GPU metrics. I assume this is because these GPU "metrics" are technically logs and not "official" metrics. However, azure does support something called log-based metrics, where the results of log queries can be treated as an "official" metric, but nowhere do I see how to create my own custom log-based metric.

Furthermore, Kubernetes supports custom and external metrics through their Metrics API, where metrics can be retrieved from external sources (such as Azure's Application Insights). Azure has an implementation of the Metrics API called Azure Kubernetes Metrics Adapter. Perhaps I need to expose the containerGpuDutyCycle metric as an external metric using this? If so, how do I reference/expose the metric as external/custom?

Alternative Solutions

My main concern is exposing the GPU metrics for my HPA. I'm using Azure's Kubernetes Metrics Adapter for now as I assumed it would better integrate into my AKS cluster (same eco-system). However, it's in alpha stage (not production ready). If anyone can solve my problem using an alternative metric adapter (e.g. Prometheus), that would still be very helpful.

Many thanks for any light you can shed on this issue.

I referenced this question here: docs.microsoft.com/en-us/answers/questions/255792/… When the answer comes back, I'll post it here as well if they don't. — ndtreviv
They sent me here: github.com/pahud/amazon-eks-gpu-scale Does that help? — ndtreviv

ndtreviv ndtreviv · Accepted Answer · 2021-03-25T11:03:59

I managed to do this recently (just this week). I'll outline my solution and all the gotchas, in case that helps.

Starting with an AKS cluster, I installed the following components in order to harvest the GPU metrics:

nvidia-device-plugin - to make GPU metrics collectable
dcgm-exporter - a daemonset to reveal GPU metrics on each node
kube-prometheus-stack - to harvest the GPU metrics and store them
prometheus-adapter - to make harvested, stored metrics available to the k8s metrics server

The AKS cluster comes with a metrics server built in, so you don't need to worry about that. It is also possible to provision the cluster with the nvidia-device-plugin already applied, but currently not possible via terraform (Is it possible to use aks custom headers with the azurerm_kubernetes_cluster resource?), which is how I was deploying my cluster.

To install all this stuff I used a script much like the following:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add gpu-helm-charts https://nvidia.github.io/gpu-monitoring-tools/helm-charts

helm repo update

echo "Installing the NVIDIA device plugin..."
helm install nvdp/nvidia-device-plugin \
--generate-name \
--set migStrategy=mixed \
--version=0.9.0

echo "Installing the Prometheus/Grafana stack..."
helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--values ./kube-prometheus-stack.values

prometheus_service=$(kubectl get svc -nprometheus -lapp=kube-prometheus-stack-prometheus -ojsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace prometheus \
--set rbac.create=true,prometheus.url=http://${prometheus_service}.prometheus.svc.cluster.local,prometheus.port=9090

helm install gpu-helm-charts/dcgm-exporter \
--generate-name

Actually, I'm lying about the dcgm-exporter. I was experiencing a problem (my first "gotcha") where the dcgm-exporter was not responding to liveness requests in time, and was consistently entering a CrashLoopBackoff status (https://github.com/NVIDIA/gpu-monitoring-tools/issues/120). To get around this, I created my own dcgm-exporter k8s config (by taking details from here and modifying them slightly: https://github.com/NVIDIA/gpu-monitoring-tools) and applied it. In doing this I experienced my second "gotcha", which was that in the latest dcgm-exporter images they have removed some GPU metrics, such as DCGM_FI_DEV_GPU_UTIL, largely because these metrics are resource intensive to collect (see https://github.com/NVIDIA/gpu-monitoring-tools/issues/143). If you want to re-enable them make sure you run the dcgm-exporter with the arguments set as: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"] OR you can create your own image and supply your own metrics list, which is what I did by using this Dockerfile:

FROM nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04

RUN sed -i -e '/^# DCGM_FI_DEV_GPU_UTIL.*/s/^#\ //' /etc/dcgm-exporter/default-counters.csv

ENTRYPOINT ["/usr/local/dcgm/dcgm-exporter-entrypoint.sh"]

Another thing you can see from the above script is that I also used my own Prometheus helm chart values file. I followed the instructions from nvidia's site (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html), but found my third "gotcha" in the additionalScrapeConfig.

What I learned was that, in the final deployment, the HPA has to be in the same namespace as the service it's scaling (identified by targetRef), otherwise it can't find it to scale it, as you probably already know.

But just as importantly the dcgm-metrics Service also has to be in the same namespace, otherwise the HPA can't find the metrics it needs to scale by. So, I changed the additionalScrapeConfig to target the relevant namespace. I'm sure there's a way to use the additionalScrapeConfig.relabel_configs section to enable you to keep dcgm-exporter in a different namespace and still have the HPA find the metrics, but I haven't had time to learn that voodoo yet.

Once I had all of that, I could check that the DCGM metrics were being made available to the kube metrics server:

$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL

In the resulting list you really want to see a services entry, like so:

"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "services/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",

If you don't it probably means that the dcgm-exporter deployment you used is missing the ServiceAccount component, and also the HPA still won't work.

Finally, I wrote my HPA something like this:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: my-namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: X
  maxReplicas: Y
...
  metrics:
  - type: Object
    object:
      metricName: DCGM_FI_DEV_GPU_UTIL
      targetValue: 80
      target:
        kind: Service
        name: dcgm-exporter

and it all worked.

I hope this helps! I spent so long trying different methods shown by people on consultancy company blogs, medium posts etc before discovering that people who write these pieces have already made assumptions about your deployment which affect details that you really need to know about (eg: the namespacing issue).