I'm deploying helm charts using community.kubernetes.helm with ease but I've run into conditions where the connection is refused and it's not clear how best to configure a retries/wait/until. I've run into a case where every now and then, helm can't communicate with the cluster, here's an example (dns/ip faked) showing that the issue is as simple as not being able to connect to the cluster:
fatal: [localhost]: FAILED! => {"changed": false, "command": "/usr/local/bin/helm --kubeconfig /var/opt/kubeconfig --namespace=gpu-operator list --output=yaml --filter gpu-operator", "msg": "Failure when executing Helm command. Exited 1.\nstdout: \nstderr: Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused\n", "stderr": "Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused\n", "stderr_lines": ["Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused"], "stdout": "", "stdout_lines": []}
In my experience, I have seen that try/retry will work. I agree that it would be ideal to figure out why I can't connect to the service, but it would be even more ideal to work around this by taking advantage of a catch all "until" block that tries this block until it works or gives up after N tries while taking a break of N seconds.
Here's an example of the ansible block:
- name: deploy Nvidia GPU Operator
block:
- name: deploy gpu operator
community.kubernetes.helm:
name: gpu-operator
chart_ref: "{{ CHARTS_DIR }}/gpu-operator"
create_namespace: yes
release_namespace: gpu-operator
kubeconfig: "{{ STATE_DIR }}/{{ INSTANCE_NAME }}-kubeconfig"
until: ???
retries: 5
delay: 3
when: GPU_NODE is defined
I would really appreciate any suggestions/pointers.