What's the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?

Question

I'm deploying helm charts using community.kubernetes.helm with ease but I've run into conditions where the connection is refused and it's not clear how best to configure a retries/wait/until. I've run into a case where every now and then, helm can't communicate with the cluster, here's an example (dns/ip faked) showing that the issue is as simple as not being able to connect to the cluster:

fatal: [localhost]: FAILED! => {"changed": false, "command": "/usr/local/bin/helm --kubeconfig /var/opt/kubeconfig --namespace=gpu-operator list --output=yaml --filter gpu-operator", "msg": "Failure when executing Helm command. Exited 1.\nstdout: \nstderr: Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused\n", "stderr": "Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused\n", "stderr_lines": ["Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused"], "stdout": "", "stdout_lines": []}

In my experience, I have seen that try/retry will work. I agree that it would be ideal to figure out why I can't connect to the service, but it would be even more ideal to work around this by taking advantage of a catch all "until" block that tries this block until it works or gives up after N tries while taking a break of N seconds.

Here's an example of the ansible block:

- name: deploy Nvidia GPU Operator
  block:
    - name: deploy gpu operator
      community.kubernetes.helm:
        name: gpu-operator
        chart_ref: "{{ CHARTS_DIR }}/gpu-operator"
        create_namespace: yes
        release_namespace: gpu-operator
        kubeconfig: "{{ STATE_DIR }}/{{ INSTANCE_NAME }}-kubeconfig"
      until: ??? 
      retries: 5
      delay: 3
  when: GPU_NODE is defined

I would really appreciate any suggestions/pointers.

russellsimokins russellsimokins · Accepted Answer · 2021-03-04T18:47:26

I discovered that registering the output and then testing until it's defined get's ansible to rerun. The key is learning what is going to be a successful output. For helm, it says it will define a status when it works correctly. So, this is what you need to add

  register: _gpu_result
  until: _gpu_result.status is defined
  ignore_errors: true
  retries: 5
  delay: 3

retries/delay is up to you

What's the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?

1 Answers