2
votes

We've recently reconfigured our build process to run entirely in containers, and we're now looking to migrate away from on-premise build agents to using agents in an Azure Scale Set.

We want to avoid having to maintain our own VM images for the Azure Scale Set, and have opted to use the default Ubuntu 18.04 LTS image which is available in Azure.

This image does not include Docker, so we've configured the Azure Scale Set to use a cloud-config script which will install Docker when the VM first boots:

#cloud-config

apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88

packages:
  - docker-ce
  - docker-ce-cli

groups:
  - docker

This seems to work well, but sometimes the build jobs fail:

Starting: Initialize containers
/usr/bin/docker version --format '{{.Server.APIVersion}}'
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
'
##[error]Exit code 1 returned from process: file name '/usr/bin/docker', arguments 'version --format '{{.Server.APIVersion}}''.
Finishing: Initialize containers

enter image description here

It looks like either the cloud-init script failed, or the Azure DevOps agent started on the VM before the cloud-init script completed.

So far, I've seen the following scenarios:

  • Provisioning a new VM works fine, and the jobs run correctly
  • The first few jobs fail on a newly provisioned VM, and then run correctly. (Perhaps because the cloud-init script ran in parallel with Azure DevOps extension which deploys the agent to the VM, and you have a race condition?)
  • All jobs fail, even after say 30 minutes. Sometimes reimagining the VM helps, sometimes it does not.

Does anyone have a similar setup? Does it work properly? If not, what are alternative ways to deploy Docker to the VMs before the VM runs a container job?

1
What do you mean sometimes? Was the VM running when you run the pipeline? Though this issue seems to be more related to the setup in your VM, could you share some details about your pipeline?LoLance
"Sometimes", as in, 1 out of every 3 jobs, roughly? This is a scale set agent, so there may not have been any VM at all when the job was triggered. But the pipeline did start on an agent (which may have been provisioned on-demand), and the error message was generated by code running on an agent. The job itself is a container job, and the failure occurs in the 'Initialize containers' step. After that, the pipeline contains a bunch of shell scripts, which are never executed, because the container job did not start successfully (because Docker was somehow not installed correctly / on time).Frederik Carlier
The Initialize containers step fails cause your VM was not correctly started, so azure devops failed at Initialize containers with error Is the docker daemon running?. The cause of this issue seems to come from Azure VM side which I'm not familiar with...LoLance
Yes, it looks like this is a bug/limitation of the auto-scaling agent: github.com/microsoft/azure-pipelines-agent/issues/2866, github.com/Azure/WALinuxAgent/issues/1938Frederik Carlier

1 Answers

5
votes

When you configure Azure DevOps agent pool to use an Azure Scale Set to provision build machines, the Microsoft.Azure.DevOps.Pipelines.Agent/TeamServicesAgentLinux extension is automatically added to your scale set.

This extension is responsible for installing the Azure DevOps agent on your VMs and adding it to your agent pool.

The extension runs when the VM boots, at about the same time as the cloud-init script. This can cause race conditions.

To work around this, add a bootcmd script to your cloud-config script which forces the walinuxagent agent service (which will launch the Azure DevOps extension) after the cloud-config script, like this:

#cloud-config

bootcmd:
  - mkdir -p /etc/systemd/system/walinuxagent.service.d
  - echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/walinuxagent.service.d/override.conf
  - sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
  - systemctl daemon-reload

apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88

packages:
  - docker-ce
  - docker-ce-cli

groups:
  - docker

This allows you to create an Azure DevOps scale set agent pool which uses the standard Ubuntu 18.04 image, and installs docker on top of that image.

See https://github.com/microsoft/azure-pipelines-agent/issues/2866 and https://github.com/Azure/WALinuxAgent/issues/1938#issuecomment-657293920 for more background.