1
votes

I have an Azure Batch pool where I have mounted three blob storage containers. It works, however, when the nodes are booting for the first time they get the following error:

Mount configuration error

Looking in the logs it seems the nodes have trouble installing the blobfuse package. Getting this error message repeatedly:

2020-03-11T09:15:48,654579941+00:00 - INFO: Downloading: https://packages.microsoft.com/keys/microsoft.asc as microsoft.asc
2020-03-11T09:15:48,770319520+00:00 - INFO: Downloading: https://packages.microsoft.com/config/ubuntu/16.04/prod.list as /etc/apt/sources.list.d/microsoft-prod.list
Hit:1 http://azure.archive.ubuntu.com/ubuntu xenial InRelease
Hit:2 http://azure.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:3 http://azure.archive.ubuntu.com/ubuntu xenial-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu xenial-security InRelease
Get:5 https://packages.microsoft.com/ubuntu/16.04/prod xenial InRelease [4,002 B]
Get:6 https://packages.microsoft.com/ubuntu/16.04/prod xenial/main amd64 Packages [124 kB]
Fetched 128 kB in 0s (383 kB/s)
Reading package lists...
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
...

2020-03-11T09:16:53,361634408+00:00 - ERROR: Could not install packages (apt): blobfuse

The nodes then proceed to the state Unusable until I manually reboot them which will "fix" the problem and then the node starts working on a task.

The task should be running with elevated privileges:

UserIdentity = new UserIdentity(
                new AutoUserSpecification(
                      elevationLevel: ElevationLevel.Admin,
                      scope: AutoUserScope.Pool
                  )
              ),

Update 1

I was not able to solve this problem so I decided to work around it. Resizing or recreating the pool did not help. In stead I installed blobfuse in the Docker image and mount the blob storage containers in the task itself. This works just fine.

2
do you need root access to install those packages?Aravind
@Aravind possibly. The task itself is running as admin, but this process is running on the pool itself when new nodes are joining it. I don't have much control over the process. All I have done is add the mount configurations and I thought azure was supposed to fix the rest.niknoe
+1: This looks like the installation issue with the Blobfuse package, Try resizing or reboot your pool back to zero and then scaling back up and I think as joinpool the new Joining VM's should trigger new install. This should help.Tats_innit
Hiya @NiklasNoem good approach but remember it just keep this in back if your kind that blobfuse driver has an exisiting hug where the it hands after 65 hours detail here: github.com/Azure/azure-storage-fuse/issues/329 , also if docker is not necessary you can always do that via a script and at the start task level for normal batch nodes.Tats_innit
@Tats_innit Thank you for the tip. That's probably longer than any of the tasks will run and the pool is auto scaling so it will reconnect. Thanks again for the help.niknoe

2 Answers

3
votes

Your approach looks good and glad that reboot fixed it, in this specific case this is the right fix along with re-size.

Thanks for sharing logs, this looks like issue with the failure to Installation with the Blobfuse

Big give away is: **ERROR: Could not install packages (apt): blobfuse** is the biggest indicator, essentially under the hood node looks to blobfuse to be installed and seems like some process has a long running apt install in parallel. The cause of this error is detailed here E: Could not get lock /var/lib/dpkg/lock-frontend - open.

2 Possible Solution

  • Like you mention reboot fixed it.
  • Another option is to resize the pool. OR better : recreate the pool and then try.

Why the reboot or resize will fix it : essentially in both cases VM will invoke the join pool process at batch end with refresh memory which will help in un-blocking the lock scenario for blobfuse. At batchnode level we can try some sort of back off mechanism. I would also keep an eye on blobfuse and if something within caused this https://github.com/Azure/azure-storage-fuse

Hope this helps.

1
votes

I had exactly the same issue and as I'm creating dynamic pools and tasks manually stepping in and rebooting for me was not an option.

My work around was linking the batch account to the storage account then specifying the resource files as part of the task. This enabled the container to be visible in the working directory of the task.

Here is my example in NodeJS, it should be transferable to your language of choice.

{
        id: taskId,
        commandLine: "",
        containerSettings: taskContainerSettings,
        environmentSettings: [
            {
                name: "USER_ID",
                value: userId
            },
            {
                name: "RUN_ID",
                value: runId
            }],
        resourceFiles: [{"autoStorageContainerName": userId}]
};