1
votes

I am trying to parallelize a Python App using Azure Batch.The workflow that I have followed in the Python client-side script is: 1) Upload Local files to Azure Blob Container using blobxfer utility (input-container)

2) Start Batch service to Process the files in input-container after logging in using the service principal account with azure-cli.

3) Upload the files to output-container through the python app distributed across the Nodes with Azure Batch.

I am experiencing a problem very similar to the one I read here but unfortunately no solution was given in this post. Nodes go into Unusable State

I will now give the relevant information so that one can reproduce this error:

The image that was used for Azure Batch is custom.

1) Ubuntu Server 18.04 LTS was chosen as the OS for the VM and the following ports were opened-ssh,http,https.The rest of the setting were kept default in the Azure portal.

2)The following script was run once the server was available.

sudo apt-get install build-essential checkinstall -y
sudo apt-get install libreadline-gplv2-dev  libncursesw5-dev libssl-dev 
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev -y
cd /usr/src
sudo wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
sudo tar xzf Python-3.6.6.tgz
cd Python-3.6.6
sudo ./configure --enable-optimizations
sudo make altinstall
sudo pip3.6 install --upgrade pip
sudo pip3.6 install pymupdf==1.13.20
sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47

3) An Image of this server was created using the process outlined in this link. Creating VM Image in Azure Linux Also during deprovision the user was not deleted:sudo waagent -deprovision

4) The Resource Id of the image was noted from the Azure Portal.This will be supplied as one of the parameters in the python-client-side script

The packages installed on the client side server where the python script for Batch would run

sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47
sudo pip3.6 install pandas==0.22.0

The Resources used during Azure Batch were created in the following way:

1) Service Principal account with contributor privileges was created using the cmd.

$az ad sp create-for-rbac --name <SERVICE-PRINCIPAL-ACCOUNT>

2) Resource-Group,Batch-Account and Storage associated with Batch Account were created in the following way:

$ az group create --name <RESOURCE-GROUP-NAME> --location eastus2
$ az storage account create --resource-group <RESOURCE-GROUP-NAME> --name <STORAGE-ACCOUNT-NAME> --location eastus2 --sku Standard_LRS
$ az batch account create --name <BATCH-ACCOUNT-NAME> --storage-account <STORAGE-ACCOUNT-NAME> --resource-group <RESOURCE-GROUP-NAME> --location eastus2

The client-side Python script which initiates the upload and processing: (Update 3)

import subprocess
import os
import time
import datetime
import tqdm
import pandas
import sys
import fitz
import parmap
import numpy as np
import sentry_sdk
import multiprocessing as mp


def batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path):
    try:
        subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Login Credentials")
        sys.exit("Invalid Azure Login Credentials")
    dir_flag=False
    while dir_flag==False:
        try:
            no_of_dir=input("Enter the number of directories to upload:")
            no_of_dir=int(no_of_dir)
            if no_of_dir<0:
                print("\nRetry:Enter an integer value")   
            else: 
                dir_flag=True
        except ValueError:
            print("\nRetry:Enter an integer value")
    dir_path_list=[]
    for dir in range(no_of_dir):
        path_exists=False
        while path_exists==False:
            dir_path=input("\nEnter the local absolute path of the directory no.{}:".format(dir+1))
            print("\n")
            dir_path=dir_path.replace('"',"")
            path_exists=os.path.isdir(dir_path)
            if path_exists==True:
                dir_path_list.append(dir_path)
            else:
                print("\nRetry:Enter a valid directory path")
    timestamp = time.time()
    timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
    input_azure_container="pdf-processing-input"+"-"+timestamp_humanreadable
    try:
        subprocess.check_output(["az","storage","container","create","--name",input_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Storage Credentials.")
        sys.exit("Invalid Azure Storage Credentials.")
    log_file_path=os.path.join(log_dir_path,"upload-logs"+"-"+timestamp_humanreadable+".txt")
    dir_upload_success=[]
    dir_upload_failure=[]
    for dir in tqdm.tqdm(dir_path_list,desc="Uploading Directories"):
        try:
            subprocess.check_output(["blobxfer","upload","--remote-path",input_azure_container,"--storage-account",azure_storage_account,\
            "--enable-azure-storage-logger","--log-file",\
            log_file_path,"--storage-account-key",azure_storage_account_key,"--local-path",dir]) 
            dir_upload_success.append(dir)
        except subprocess.CalledProcessError:
            sentry_sdk.capture_message("Failed to upload directory: {}".format(dir))
            dir_upload_failure.append(dir)
    return(input_azure_container)

def query_azure_storage(azure_storage_container,azure_storage_account,azure_storage_account_key,blob_file_path):
    try:
        blob_list=subprocess.check_output(["az","storage","blob","list","--container-name",azure_storage_container,\
        "--account-key",azure_storage_account_key,"--account-name",azure_storage_account,"--auth-mode","login","--output","tsv"])
        blob_list=blob_list.decode("utf-8")
        with open(blob_file_path,"w") as f:
            f.write(blob_list)
        blob_df=pandas.read_csv(blob_file_path,sep="\t",header=None)
        blob_df=blob_df.iloc[:,3]
        blob_df=blob_df.to_frame(name="container_files")
        blob_df=blob_df.assign(container=azure_storage_container)
        return(blob_df)
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Storage Credentials")
        sys.exit("Invalid Azure Storage Credentials.")

def analyze_files_for_tasks(data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,download_folder):
    try:
        blob_df=data_split
        some_calculation_factor=2
        analyzed_azure_blob_df=pandas.DataFrame()
        analyzed_azure_blob_df=analyzed_azure_blob_df.assign(container="empty",container_files="empty",pages="empty",max_time="empty")
        for index,row in blob_df.iterrows():
            file_to_analyze=os.path.join(download_folder,row["container_files"])
            subprocess.check_output(["az","storage","blob","download","--container-name",azure_storage_container,"--file",file_to_analyze,"--name",row["container_files"],\
            "--account-name",azure_storage_account,"--auth-mode","key"])        #Why does login auth not work for this while we are multiprocessing
            doc=fitz.open(file_to_analyze)
            page_count=doc.pageCount
            analyzed_azure_blob_df=analyzed_azure_blob_df.append([{"container":azure_storage_container,"container_files":row["container_files"],"pages":page_count,"max_time":some_calculation_factor*page_count}])
            doc.close()
            os.remove(file_to_analyze)
        return(analyzed_azure_blob_df)
    except Exception as e:
        sentry_sdk.capture_exception(e)


def estimate_task_completion_time(azure_storage_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path):
    try: 
        cores=mp.cpu_count()                                           #Number of CPU cores on your system
        partitions = cores-2  
        timestamp = time.time()
        timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
        file_download_location=os.path.join(azure_blob_downloads_file_path,"Blob_Download"+"-"+timestamp_humanreadable)
        os.mkdir(file_download_location)
        data_split = np.array_split(azure_blob_df,indices_or_sections=partitions,axis=0)
        analyzed_azure_blob_df=pandas.concat(parmap.map(analyze_files_for_tasks,data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,file_download_location,\
        pm_pbar=True,pm_processes=partitions))
        analyzed_azure_blob_df=analyzed_azure_blob_df.reset_index(drop=True)
        return(analyzed_azure_blob_df)
    except Exception as e:
        sentry_sdk.capture_exception(e)
        sys.exit("Unable to Estimate Job Completion Status")

def azure_batch_create_pool(azure_storage_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df):
    timestamp = time.time()
    timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
    pool_id="pdf-processing"+"-"+timestamp_humanreadable
    try:
        subprocess.check_output(["az","batch","account","login","--name", azure_batch_account,"--resource-group",azure_resource_group])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to log into the Batch account")
        sys.exit("Unable to log into the Batch account")
    #Pool autoscaling formula would go in here
    try:
        subprocess.check_output(["az","batch","pool","create","--account-endpoint",azure_batch_account_endpoint, \
        "--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--id",pool_id,\
        "--node-agent-sku-id","batch.node.ubuntu 18.04",\
        "--image",vm_image_name,"--target-low-priority-nodes",str(no_nodes),"--vm-size",vm_compute_size])
        return(pool_id)
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))
        sys.exit("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))

def azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
    timestamp = time.time()
    timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
    job_id="pdf-processing-job"+"-"+timestamp_humanreadable
    try:
    subprocess.check_output(["az","batch","job","create","--account-endpoint",azure_batch_account_endpoint,"--account-key",\
    azure_batch_account_key,"--account-name",azure_batch_account,"--id",job_id,"--pool-id",pool_info])
    return(job_id)
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to create a Job on the Pool :{}".format(pool_info))
        sys.exit("Unable to create a Job on the Pool :{}".format(pool_info))

def azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,azure_storage_container,analyzed_azure_blob_df):
    print("\n")
    for i in tqdm.tqdm(range(180),desc="Waiting for the Pool to Warm-up"):
        time.sleep(1)
    successful_task_list=[]
    unsuccessful_task_list=[]
    input_azure_container=azure_storage_container 
    output_azure_container= "pdf-processing-output"+"-"+input_azure_container.split("-input-")[-1]
    try:
        subprocess.check_output(["az","storage","container","create","--name",output_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
    except subprocess.CalledProcessError:
        sentry_sdk.cpature_message("Unable to create an output container")
        sys.exit("Unable to create an output container")
    print("\n")
    pbar = tqdm.tqdm(total=analyzed_azure_blob_df.shape[0],desc="Creating and distributing Tasks")
    for index,row in analyzed_azure_blob_df.iterrows():
        try:
            task_info="mytask-"+str(index)
            subprocess.check_output(["az","batch","task","create","--task-id",task_info,"--job-id",job_info,"--command-line",\
            "python3 /home/avadhut/pdf_processing.py {} {} {}".format(input_azure_container,output_azure_container,row["container_files"])])
            pbar.update(1)
        except subprocess.CalledProcessError:
            sentry_sdk.capture_message("unable to create the Task: mytask-{}".format(i))
            pbar.update(1)
    pbar.close()

def wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df):
        try:
            print(analyzed_azure_blob_df)
            nrows_tasks_df=analyzed_azure_blob_df.shape[0]
            print("\n")
            pbar=tqdm.tqdm(total=nrows_tasks_df,desc="Waiting for task to complete")
            for index,row in analyzed_azure_blob_df.iterrows():
                task_list=subprocess.check_output(["az","batch","task","list","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,\
                "--output","tsv"])
                task_list=task_list.decode("utf-8")
                with open(task_file_path,"w") as f:
                    f.write(task_list)
                task_df=pandas.read_csv(task_file_path,sep="\t",header=None)
                task_df=task_df.iloc[:,21]
                active_task_list=[]
                for x in task_df:
                    if x =="active":
                        active_task_list.append(x)
                if len(active_task_list)>0:
                    time.sleep(row["max_time"])  #This time can be changed in accordance with the time taken to complete each task
                    pbar.update(1)
                    continue
                else:
                    pbar.close()
                    return("success")
            pbar.close()
            return("failure")
        except subprocess.CalledProcessError:
            sentry_sdk.capture_message("Error in retrieving task status")

def azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info):
    try:
        subprocess.check_output(["az","batch","job","delete","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to delete Job-{}".format(job_info))

def azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
    try:
        subprocess.check_output(["az","batch","pool","delete","--pool-id",pool_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to delete Pool--{}".format(pool_info))

if __name__=="__main__":
    print("\n")
    print("-"*40+"Azure Batch processing POC"+"-"*40)
    print("\n")

    #Credentials and initializations
    sentry_sdk.init(<SENTRY-CREDENTIALS>) #Sign-up for a Sentry trail account
    azure_username=<AZURE-USERNAME>
    azure_password=<AZURE-PASSWORD>
    azure_tenant=<AZURE-TENANT>
    azure_resource_group=<RESOURCE-GROUP-NAME>
    azure_storage_account=<STORAGE-ACCOUNT-NAME>
    azure_storage_account_key=<STORAGE-KEY>
    azure_batch_account_endpoint=<BATCH-ENDPOINT>
    azure_batch_account_key=<BATCH-ACCOUNT-KEY>
    azure_batch_account=<BATCH-ACCOUNT-NAME>
    vm_image_name=<VM-IMAGE>
    vm_compute_size="Standard_A4_v2"
    no_nodes=2
    log_dir_path="/home/user/azure_batch_upload_logs/"
    azure_blob_downloads_file_path="/home/user/blob_downloads/"
    blob_file_path="/home/user/azure_batch_upload.tsv"
    task_file_path="/home/user/azure_task_list.tsv"


    input_azure_container=batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path)

    azure_blob_df=query_azure_storage(input_azure_container,azure_storage_account,azure_storage_account_key,blob_file_path)

    analyzed_azure_blob_df=estimate_task_completion_time(input_azure_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path)

    pool_info=azure_batch_create_pool(input_azure_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df)

    job_info=azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)

    azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,input_azure_container,analyzed_azure_blob_df)

    task_status=wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df)

    if task_status=="success":
        azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
        azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
        print("\n\n")
        sys.exit("Job Complete")
    else:
        azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
        azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
        print("\n\n")
        sys.exit("Job Unsuccessful")

cmd used to create the zip file:

zip pdf_process_1.zip pdf_processing.py

The Python App that was packaged in zip file and uploaded to batch through the client-side script

(Update 3)

import os
import fitz
import subprocess
import argparse
import time
from tqdm import tqdm
import sentry_sdk
import sys
import datetime

def azure_active_directory_login(azure_username,azure_password,azure_tenant):
    try:
        azure_login_output=subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Login Credentials")
        sys.exit("Invalid Azure Login Credentials")

def download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path):
    file_to_download=os.path.join(input_azure_container,file_to_process)
    try:
        subprocess.check_output(["az","storage","blob","download","--container-name",input_azure_container,"--file",os.path.join(pdf_docs_path,file_to_process),"--name",file_to_process,"--account-key",azure_storage_account_key,\
        "--account-name",azure_storage_account,"--auth-mode","login"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("unable to download the pdf file")
        sys.exit("unable to download the pdf file")

def pdf_to_png(input_folder_path,output_folder_path):
    pdf_files=[x for x in os.listdir(input_folder_path) if x.endswith((".pdf",".PDF"))]
    pdf_files.sort()
    for pdf in tqdm(pdf_files,desc="pdf--->png"):
        doc=fitz.open(os.path.join(input_folder_path,pdf))
        page_count=doc.pageCount
        for f in range(page_count):
            page=doc.loadPage(f)
            pix = page.getPixmap()
            if pdf.endswith(".pdf"):
                png_filename=pdf.split(".pdf")[0]+"___"+"page---"+str(f)+".png"
                pix.writePNG(os.path.join(output_folder_path,png_filename))
            elif pdf.endswith(".PDF"):
                png_filename=pdf.split(".PDF")[0]+"___"+"page---"+str(f)+".png"
                pix.writePNG(os.path.join(output_folder_path,png_filename))


def upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path):
    try:
        subprocess.check_output(["az","storage","blob","upload-batch","--destination",output_azure_container,"--source",png_docs_path,"--account-key",azure_storage_account_key,\
        "--account-name",azure_storage_account,"--auth-mode","login"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to upload file to the container")


if __name__=="__main__":
    #Credentials 
    sentry_sdk.init(<SENTRY-CREDENTIALS>)
    azure_username=<AZURE-USERNAME>
    azure_password=<AZURE-PASSWORD>
    azure_tenant=<AZURE-TENANT>
    azure_storage_account=<AZURE-STORAGE-NAME>
    azure_storage_account_key=<AZURE-STORAGE-KEY>
    try:
        parser = argparse.ArgumentParser()
        parser.add_argument("input_azure_container",type=str,help="Location to download files from")
        parser.add_argument("output_azure_container",type=str,help="Location to upload files to")
        parser.add_argument("file_to_process",type=str,help="file link in azure blob storage")
        args = parser.parse_args()
        timestamp = time.time()
        timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
        task_working_dir=os.getcwd()
        file_to_process=args.file_to_process
        input_azure_container=args.input_azure_container
        output_azure_container=args.output_azure_container
        pdf_docs_path=os.path.join(task_working_dir,"pdf_files"+"-"+timestamp_humanreadable)
        png_docs_path=os.path.join(task_working_dir,"png_files"+"-"+timestamp_humanreadable)
        os.mkdir(pdf_docs_path)
        os.mkdir(png_docs_path)
    except Exception as e:
        sentry_sdk.capture_exception(e)
    azure_active_directory_login(azure_username,azure_password,azure_tenant)
    download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path)
    pdf_to_png(pdf_docs_path,png_docs_path)
    upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path)

Update 1: I have solved the Server Nodes going into unusable state error.The way I solved this issue is:

1) I did not use the cmds I mentioned above to set up Python env 3.6 on Ubuntu as Ubuntu 18.04 LTS comes with its own python 3 environment.Initially I had googled "Install Python 3 on Ubuntu" and had gotten this Python 3.6 installation on Ubuntu link.Avoided this step completely during the server set-up. All I did was install these packages this time.

sudo apt-get install -y python3-pip
sudo -H pip3 install tqdm==4.19.9
sudo -H pip3 install sentry-sdk==0.4.1
sudo -H pip3 install blobxfer==1.5.0
sudo -H pip3 install pandas==0.22.0

The Azure cli was installed on the machine using the cmds in this link Install Azure CLI with apt

2) Created a snapshot of the OS-disk and then created the image out of this snapshot and finally referencing this image in the client-side script.

I am now faced with another issue where the stderr.txt files on the node tell me that:

  python3: can't open file '$AZ_BATCH_APP_PACKAGE_pdfprocessingapp/pdf_processing.py': [Errno 2] No such file or directory

Logging in to the server with the random user I see that the directory _azbatch is created but there are no contents inside this directory.

No directory structure seen

I know for certain that it is the command line of the azure_batch_create_task() function that things are going haywire but I am not able to put my finger on it.I have done everything that this docs recommends:Install app packages to Azure Batch Compute Nodes Please review my client-side Python script and let me know on what I am doing wrong!

Edit 3: The problem looks very similar to the one described in this post: Unable to pass app path to Tasks

Update 2:

I was able to overcome the file/directory not found error using a dirty hack which i am not particularly fond of.I placed the python app in the home directory of the user which was used to create the VM and all the directories required for processing were created in the working directory of the task.

I still would want to know how I would run the workflow by using the application package way to deploy it to the node.

Update 3

I have updated the client side code and python app to reflect the latest changes made. Things that are significant are the same.....

I will comment on @fparks points that he/she has raised.

The Original python App that I intend to use in Azure Batch contains many modules and some config files and a quite lengthy requirements.txt file for Python packages.Azure also recommends using custom Image in such cases. Also downloading the python modules per Task is a bit irrational in my case as 1 task is equal to a multipage pdfs and my expected workload is 25k multipage pdfs I used CLI because the docs for Python SDK were sparse and hard to follow.The nodes going into unusable state has been solved.I do agree with you on the blobxfer error.

1

1 Answers

0
votes

Answers and a few observations:

  1. It is unclear to me why you need a custom image. You can use a platform image, i.e., Canonical, UbuntuServer, 18.04-LTS, and then just install what you need as part of the start task. Python3.6 can simply be installed via apt in 18.04. You may be prematurely optimizing your workflow by opting for a custom image when in fact using a platform image + start task may be faster and stable.
  2. Your script is in Python, yet you are calling out to the Azure CLI. You may want to consider directly using the Azure Batch Python SDK instead (samples).
  3. When nodes go unusable, you should first examine the node for errors. You should see if the ComputeNodeError field is populated. Additionally, you can try to fetch stdout.txt and stderr.txt files from the startup directory to diagnose what's going on. You can do both of these actions in the Azure Portal or via Batch Explorer. If that doesn't work, you can fetch the compute node service logs and file a support request. However, typically unusable means that your custom image was provisioned incorrectly, you have a virtual network with an NSG misconfigured, or you have an application package that is incorrect.
  4. Your application package consists of a single python file; instead use a resource file. Simply upload the script to Azure Storage blob and reference it in your task as a Resource File using a SAS URL. See the --resource-files argument in az batch task create if using the CLI. Your command to invoke would then simply be python3 pdf_processing.py (assuming you keep the resource file downloading to the task working directory).
  5. If you insist on using an application package, consider using a task application package instead. This will decouple your node startup issues potentially originating from bad application packages to debugging task executions instead.
  6. The blobxfer error is pretty clear. Your locale is not set properly. The easy way to fix this is to set the environment variables for the task. See the --environment-settings argument if using the CLI and set two environment variables LC_ALL=C.UTF-8 and LANG=C.UTF-8 as part of your task.