I managed to do this recently (just this week). I'll outline my solution and all the gotchas, in case that helps.
Starting with an AKS cluster, I installed the following components in order to harvest the GPU metrics:
- nvidia-device-plugin - to make GPU metrics collectable
- dcgm-exporter - a daemonset to reveal GPU metrics on each node
- kube-prometheus-stack - to harvest the GPU metrics and store them
- prometheus-adapter - to make harvested, stored metrics available to the k8s metrics server
The AKS cluster comes with a metrics server built in, so you don't need to worry about that. It is also possible to provision the cluster with the nvidia-device-plugin already applied, but currently not possible via terraform (Is it possible to use aks custom headers with the azurerm_kubernetes_cluster resource?), which is how I was deploying my cluster.
To install all this stuff I used a script much like the following:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add gpu-helm-charts https://nvidia.github.io/gpu-monitoring-tools/helm-charts
helm repo update
echo "Installing the NVIDIA device plugin..."
helm install nvdp/nvidia-device-plugin \
--generate-name \
--set migStrategy=mixed \
--version=0.9.0
echo "Installing the Prometheus/Grafana stack..."
helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--values ./kube-prometheus-stack.values
prometheus_service=$(kubectl get svc -nprometheus -lapp=kube-prometheus-stack-prometheus -ojsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace prometheus \
--set rbac.create=true,prometheus.url=http://${prometheus_service}.prometheus.svc.cluster.local,prometheus.port=9090
helm install gpu-helm-charts/dcgm-exporter \
--generate-name
Actually, I'm lying about the dcgm-exporter
. I was experiencing a problem (my first "gotcha") where the dcgm-exporter
was not responding to liveness requests in time, and was consistently entering a CrashLoopBackoff
status (https://github.com/NVIDIA/gpu-monitoring-tools/issues/120). To get around this, I created my own dcgm-exporter
k8s config (by taking details from here and modifying them slightly: https://github.com/NVIDIA/gpu-monitoring-tools) and applied it.
In doing this I experienced my second "gotcha", which was that in the latest dcgm-exporter
images they have removed some GPU metrics, such as DCGM_FI_DEV_GPU_UTIL
, largely because these metrics are resource intensive to collect (see https://github.com/NVIDIA/gpu-monitoring-tools/issues/143). If you want to re-enable them make sure you run the dcgm-exporter
with the arguments set as: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
OR you can create your own image and supply your own metrics list, which is what I did by using this Dockerfile:
FROM nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04
RUN sed -i -e '/^# DCGM_FI_DEV_GPU_UTIL.*/s/^#\ //' /etc/dcgm-exporter/default-counters.csv
ENTRYPOINT ["/usr/local/dcgm/dcgm-exporter-entrypoint.sh"]
Another thing you can see from the above script is that I also used my own Prometheus helm chart values file. I followed the instructions from nvidia's site (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html), but found my third "gotcha" in the additionalScrapeConfig
.
What I learned was that, in the final deployment, the HPA has to be in the same namespace as the service it's scaling (identified by targetRef
), otherwise it can't find it to scale it, as you probably already know.
But just as importantly the dcgm-metrics
Service
also has to be in the same namespace, otherwise the HPA can't find the metrics it needs to scale by.
So, I changed the additionalScrapeConfig
to target the relevant namespace. I'm sure there's a way to use the additionalScrapeConfig.relabel_configs
section to enable you to keep dcgm-exporter
in a different namespace and still have the HPA find the metrics, but I haven't had time to learn that voodoo yet.
Once I had all of that, I could check that the DCGM metrics were being made available to the kube metrics server:
$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
In the resulting list you really want to see a services
entry, like so:
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "services/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",
If you don't it probably means that the dcgm-exporter deployment you used is missing the ServiceAccount
component, and also the HPA still won't work.
Finally, I wrote my HPA something like this:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
namespace: my-namespace
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: X
maxReplicas: Y
...
metrics:
- type: Object
object:
metricName: DCGM_FI_DEV_GPU_UTIL
targetValue: 80
target:
kind: Service
name: dcgm-exporter
and it all worked.
I hope this helps! I spent so long trying different methods shown by people on consultancy company blogs, medium posts etc before discovering that people who write these pieces have already made assumptions about your deployment which affect details that you really need to know about (eg: the namespacing issue).