We are using ansible to provision multiple nodes as a cluster. The machines are instances created on a custom AWS similar infrastructure. We have about a hundred tasks on different playbooks and they are executed on each node.
The problem is, we are getting sporadic host unreachable errors and playbook execution stops with the following failure:
TASK [common : install basic packages] *************************
fatal: [fqdn.for.a.node]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true}
Output with -vvv:
TASK [common : install basic packages] *******************************
task path: /jenkins/workspace/Cluster-Deployment/91/roles/common/tasks/install-basic-packages.yml:1
<fqdn.for.a.node> ESTABLISH SSH CONNECTION FOR USER: root
<fqdn.for.a.node> SSH: EXEC ssh -C -q -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=600 -o ControlPath=/home/turkenh/.ansible/cp/ansible-ssh-%h-%p-%r fqdn.for.a.node '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1466523588.96-210828884892875 `" && echo ansible-tmp-1466523588.96-210828884892875="` echo $HOME/.ansible/tmp/ansible-tmp-1466523588.96-210828884892875 `" ) && sleep 0'"'"''
failed: [fqdn.for.a.node] (item=[u'unzip']) => {"item": ["unzip"], "msg": "Failed to connect to the host via ssh.", "unreachable": true}
Here is our ansible.cfg file:
[defaults]
forks = 50
sudo_flags=-i
nocows=1
# do not check host key while doing ssh
host_key_checking = False
# use openssh not paramiko
transport = ssh
private_key_file = id_rsa
remote_user = root
Please see our notes below:
When we try to ping (with ansible ping module, not ping shell command) that host with ansible right after the failure, it throws the same error, but if we wait for about a minute or so, we can ping it.
What we can state about our custom AWS based infrastructure is that, somehow, there might be some sporadic connection issues from time to time which does not take longer than say 1-2 minutes.
Tried setting timeout parameter to a big number (i.e. 600) in ansible.cfg but it did not help.
We are provisioning nodes ubuntu, redhat and suse but no matter the OS, we are getting this error for around a probability of 20%.
It is not the same or similar tasks in my playbook where it fails, it is just failing at random ones. (sometimes in setup module, sometimes in package module, ...)
Our ansible version is 2.1 (installed with pip), os of the workstation is Ubuntu 14.04
So, what we need is, somehow, say to ansible, if you see a node as unreachable, please do not give up with a failure. Please wait for some time or retry n times before giving up with unreachable. How can we do this?
wait_for
. We use it after starting new cloud servers to wait for ssh to become available and then proceed with the tasks for this new servers. – Konstantin Suvorov