5
votes

We are using ansible to provision multiple nodes as a cluster. The machines are instances created on a custom AWS similar infrastructure. We have about a hundred tasks on different playbooks and they are executed on each node.

The problem is, we are getting sporadic host unreachable errors and playbook execution stops with the following failure:

TASK [common : install basic packages] *************************
fatal: [fqdn.for.a.node]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true}

Output with -vvv:

TASK [common : install basic packages] *******************************
task path: /jenkins/workspace/Cluster-Deployment/91/roles/common/tasks/install-basic-packages.yml:1
<fqdn.for.a.node> ESTABLISH SSH CONNECTION FOR USER: root
<fqdn.for.a.node> SSH: EXEC ssh -C -q -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=600 -o ControlPath=/home/turkenh/.ansible/cp/ansible-ssh-%h-%p-%r fqdn.for.a.node '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1466523588.96-210828884892875 `" && echo ansible-tmp-1466523588.96-210828884892875="` echo $HOME/.ansible/tmp/ansible-tmp-1466523588.96-210828884892875 `" ) && sleep 0'"'"''
failed: [fqdn.for.a.node] (item=[u'unzip']) => {"item": ["unzip"], "msg": "Failed to connect to the host via ssh.", "unreachable": true}

Here is our ansible.cfg file:

[defaults]
forks = 50
sudo_flags=-i
nocows=1

# do not check host key while doing ssh
host_key_checking = False
# use openssh not paramiko
transport = ssh
private_key_file = id_rsa
remote_user = root

Please see our notes below:

  • When we try to ping (with ansible ping module, not ping shell command) that host with ansible right after the failure, it throws the same error, but if we wait for about a minute or so, we can ping it.

  • What we can state about our custom AWS based infrastructure is that, somehow, there might be some sporadic connection issues from time to time which does not take longer than say 1-2 minutes.

  • Tried setting timeout parameter to a big number (i.e. 600) in ansible.cfg but it did not help.

  • We are provisioning nodes ubuntu, redhat and suse but no matter the OS, we are getting this error for around a probability of 20%.

  • It is not the same or similar tasks in my playbook where it fails, it is just failing at random ones. (sometimes in setup module, sometimes in package module, ...)

  • Our ansible version is 2.1 (installed with pip), os of the workstation is Ubuntu 14.04

So, what we need is, somehow, say to ansible, if you see a node as unreachable, please do not give up with a failure. Please wait for some time or retry n times before giving up with unreachable. How can we do this?

1
If this happens in a process of spinning up new servers, consider using wait_for. We use it after starting new cloud servers to wait for ssh to become available and then proceed with the tasks for this new servers.Konstantin Suvorov
Actually I already have a wait_for task which runs right after creating the AWS instances and have a wait_for until ssh is ready. I am encountering the issue at later steps, i.e. after installing some packages/starting some services etc. And as I mentioned above, the failing task is not same on different runs. But, I may consider adding a pre_task to each role which waits until ssh ready, because the issue seems to happen between role transitions. Thank you!turkenh

1 Answers

6
votes

Formally answering your question: you may increase number of ssh attempts in your inventory file with ansible_ssh_common_args="-o ConnectionAttempts=20". Specify it for problem host, group of hosts or all virtual group (e.g. in group_vars/all.yml file).

There is also ssh_args configuration option, but I prefer not to modify it, because it overwrites the ansible default ssh arguments.