0
votes

I'm deploying helm charts using community.kubernetes.helm with ease but I've run into conditions where the connection is refused and it's not clear how best to configure a retries/wait/until. I've run into a case where every now and then, helm can't communicate with the cluster, here's an example (dns/ip faked) showing that the issue is as simple as not being able to connect to the cluster:

fatal: [localhost]: FAILED! => {"changed": false, "command": "/usr/local/bin/helm --kubeconfig /var/opt/kubeconfig --namespace=gpu-operator list --output=yaml --filter gpu-operator", "msg": "Failure when executing Helm command. Exited 1.\nstdout: \nstderr: Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused\n", "stderr": "Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused\n", "stderr_lines": ["Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused"], "stdout": "", "stdout_lines": []}

In my experience, I have seen that try/retry will work. I agree that it would be ideal to figure out why I can't connect to the service, but it would be even more ideal to work around this by taking advantage of a catch all "until" block that tries this block until it works or gives up after N tries while taking a break of N seconds.

Here's an example of the ansible block:

- name: deploy Nvidia GPU Operator
  block:
    - name: deploy gpu operator
      community.kubernetes.helm:
        name: gpu-operator
        chart_ref: "{{ CHARTS_DIR }}/gpu-operator"
        create_namespace: yes
        release_namespace: gpu-operator
        kubeconfig: "{{ STATE_DIR }}/{{ INSTANCE_NAME }}-kubeconfig"
      until: ??? 
      retries: 5
      delay: 3
  when: GPU_NODE is defined

I would really appreciate any suggestions/pointers.

1

1 Answers

0
votes

I discovered that registering the output and then testing until it's defined get's ansible to rerun. The key is learning what is going to be a successful output. For helm, it says it will define a status when it works correctly. So, this is what you need to add

  register: _gpu_result
  until: _gpu_result.status is defined
  ignore_errors: true
  retries: 5
  delay: 3

retries/delay is up to you