2
votes

We are using Amazon Elastic Compute Services to spin up a cluster with autoscaling groups. Until very recently, this has been working fine, and generally it is still working fine... Except that we are no longer able to connect to the underlying EC2 instances using SSH with our keypair. We get ssh permission denied errors, which is relatively (weeks) new, and we have changed nothing. By contrast, we can spin up an EC2 instance directly and have no problem using SSH with the same keypair.

What I have done to investigate:

  1. Drained the ECS cluster, detached the instance from it, and stopped it.
  2. Detached the instance's root volume and attached it to a different EC2 instance.
  3. Observed that /home/ec2-user/.ssh does not exist.
  4. Found the following error in the instance's /var/log/cloud-init.log:
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh - wb: [644] 20 bytes
Oct 30 23:23:09 cloud-init[3195]: helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh'>)
Oct 30 23:23:09 cloud-init[3195]: util.py[WARNING]: Applying ssh credentials failed!
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Applying ssh credentials failed!
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh.py", line 184, in handle
    ssh_util.DISABLE_USER_OPTS)
AttributeError: 'module' object has no attribute 'DISABLE_USER_OPTS'
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: finish: init-network/config-ssh: SUCCESS: config-ssh ran successfully
  1. Examined the Python source code for /usr/lib/python2.7/site-packages/cloudinit. It looks OK to me; I see the reference in config/cc_ssh.py to ssh_util.DISABLE_USER_OPTS and it looks like ssh_util.py does indeed contain DISABLE_USER_OPTS as a file-level variable. (But I am not a master Python programmer, so I might be missing something subtle.)
  2. Curiously, the compiled versions of ssh_util.py and cc_ssh.py date from October 16, which raises all sorts of red flags, because we had not seen any problems with ssh until recently. But I loaded uncompyle6 and decompiled those files, and the decompiled versions seem to be OK, too.

Looking at cloud-init, it's pretty clear that if the reference to ssh_util.DISABLE_USER_OPTS throws an exception, the .ssh directory won't be configured for ec2-user, so I understand what's happening.

What I don't understand is why? Has anyone else experienced issues with cloud-init with recently-created EC2 instances under ECS, and found a workaround?

For reference, we are using AMI amzn2-ami-ecs-hvm-2.0.20190815-x86_64-ebs (ami-0b16d80945b1a9c7d) in us-east-1, and we certainly not seen these issues as far back as August 15. I assume that some cloud-init change that the instance gets via a yum update explains the new behavior and the change to the write dates of the compiled Python modules in cloud-init.

I should also add that the EC2 instance I spun up to mount the root volume of the ECS-created instance has subtly-different cloud-init code. In particular, the cc_ssh.py module doesn't refer to ssh_util.DISABLE_USER_OPTS but rather to a local DISABLE_ROOT_OPTS variable. So this is all suspicious.

2

2 Answers

4
votes

I have diagnosed this problem in a specific AWS Deployment on an Amazon Linux2 AMI. The root cause is running yum update, which causes an update of cloud-init, from user_data that is executed in cloud-init during AWS EC2 instance startup.

The user_data associated with an ECS launch_configuration is executed by cloud-init. Our user_data initialization code included a "yum update". Amazon has deployed a new version of cloud-init, 18.5-2amzn2 which is not configured in the AMI images yet (they have 18.2-72-amzn2.07 cloud-init version). Therefore, the yum update will upgrade cloud-init to the 18.5-2amzn2 version. However, analysis of the python code for the 18.5-2amzn2 version indicates that it includes a commit (https://github.com/number5/cloud-init/commit/757247f9ff2df57e792e29d8656ac415364e914d) which adds an attribute to ssh_util not present in the prior version. Ordinarily, yum would produce a consistent cloud-init installation, as verified in a standalone EC2 instance. However, since the update occurs in cloud-init, as it is already running, the results are inconsistent. The ssh_util module is apparently not updated for the running cloud-init so it can't provide the "DISABLE_USER_OPTS" value that was added in the aforementioned commit.

2
votes

So, the problem was indeed the yum-update command invoked from within cloud-init, which was updating cloud-init itself while in use.

I should point out that we were using Amazon EFS on our nodes, and were following the exact instructions that Amazon specifies on their help page for using EFS with ECS, which include the yum-update call in the user data script.