I am struggling to get my Azure batch nodes to start within a Pool that is configured to use a virtual network. The virtual network has been configured with a service endpoint policy that has a "Microsoft.Storage" policy definition and it points at a single storage account. Without the service endpoints defined on the virtual network the Azure batch pool works as expected, but with it the following error occurs and the node never starts.
I have tried creating the Batch account in both Pool allocation modes. This did not seem to make a difference, the pool resizes successfully and then the nodes are stuck in "Starting" mode. In the "User Subscription" mode I found the start-up error because I can see the VM instance in my account:
VM has reported a failure when processing extension 'batchNodeExtension'. Error message: "Enable failed: processing file downloads failed: failed to download file[0]: failed to download file: unexpected status code: actual=403 expected=200" More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot
From what I can determine this is an Azure VM extension that is running to configure the VM for Azure Batch. My base image is Canonical, ubuntuserver, 18.04-lts (batch.node.ubuntu 18.04). I can see that the extensions is attempting to download from:
https://a52a7f3c745c443e8c2cac69.blob.core.windows.net/nodeagentpackage-version9-22-0-2/Ubuntu-18.04/batch_init-ubuntu-18.04-1.8.7.tar.gz (note I removed the SAS token from this URL for posting here)
there are 8 further files that are downloaded and it looks like this is configuring the Batch agent on the node.
The 403 error indicates that the node cannot connect to this storage account, which makes sense given the service endpoint policy. It does not include this storage account within it and this storage account is external to my Azure subscription. I thought that I might be able to add it to the service endpoint policy, but I have no way of determining what Azure subscription it is part of it. If I knew this I thought I could add it like:
Endpoint policy allows you to add specific Azure Storage accounts to allow list, using the resourceID format. You can restrict access to all storage accounts in a subscription E.g. /subscriptions/subscriptionId (from https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoint-policies-overview)
I tried adding security group rules using service tags for Azure storage, but this did not help. The node still cannot connect and this makes sense given the description of service endpoint policies.
The reason for my interest in this is the following post: [https://github.com/Azure/Batch/issues/66][1]
I am trying to minimise the bandwidth charges from my storage account by using service endpoints.
I have also tried to create my own VM, but I am not sure whether the "batchNodeExtension" script is run automatically for VMs that you're using with Batch.
I would really appreciate any pointers because I am running out of ideas to try!