0
votes

I am encountering the following behavior with Azure Batch. I am using Shipyard to start a pool of 500 low-priority nodes to perform a list of 400.000 tasks. The pool size is managed using auto-scaling.

At first, the pool seems to be running just fine. The number of nodes increases to maximum capacity and the tasks complete as expected. However, after some time (having completed a sizable amount of tasks), I start to encounter 'start task failed' errors. The pool then quickly starts degrading until all nodes crash due to this same error.

This is the error I get in the stdout.txt file of one of the crashed nodes:

Login Succeeded
2020-03-04T09:09:07UTC - INFO - Docker registry logins completed.
2020-03-04T09:09:07UTC - WARNING - No Singularity registry servers found.
2020-03-04T09:13:37,840996225+00:00 - ERROR - Cascade Docker exited with non-zero exit code: 1

This seems to be an issue related to pulling the Docker image? Although it worked without issue on other nodes before.

I am aware that this is not a lot of information to go on, but I am having trouble figuring out what information is relevant and what's not.

UPDATE

After updating to shipyard 3.9.1, this is the output in stdout.txt for one of the crashed nodes (start task failed):

2020-03-05T08:23:43,784166638+00:00 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0)
2020-03-05T08:23:58,876629647+00:00 - ERROR - Error response from daemon: Get https://mcr.microsoft.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-03-05T08:23:58,878254953+00:00 - ERROR - No fallback registry specified, terminating
1

1 Answers

0
votes

Please see the GitHub issue https://github.com/Azure/batch-shipyard/issues/340. You will likely need to upgrade your Batch Shipyard version and recreate your pool.