0
votes

I'm writing a task manager for Azure Batch in Python. When I run the manager, and add a Job to the specified Azure Batch account, I do:

  1. check if the specified job id already exists
  2. if yes, delete the job
  3. create the job

Unfortunately I fail between step 2 and 3. This is because, even if I issue the deletion command for the specified job and check that there is no job with the same id in the Azure Batch Account, I get a BatchErrorException like the following when I try to create the job again:

Exception encountered:
The specified job has been marked for deletion and is being garbage collected.

The code I use to delete the job is the following:

def deleteJob(self, jobId):

    print("Delete job [{}]".format(jobId))

    self.__batchClient.job.delete(jobId)

    # Wait until the job is deleted
    # 10 minutes timeout for the operation to succeed
    timeout = datetime.timedelta(minutes=10)
    timeout_expiration = datetime.datetime.now() + timeout 
    while True:

        try:
            # As long as we can retrieve data related to the job, it means it is still deleting
            self.__batchClient.job.get(jobId)
        except batchmodels.BatchErrorException:
            print("Job {jobId} deleted correctly.".format(
                jobId = jobId
                ))
            break

        time.sleep(2)

        if datetime.datetime.now() > timeout_expiration:
            raise RuntimeError("ERROR: couldn't delete job [{jobId}] within timeout period of {timeout}.".format(
                jobId = jobId
                , timeout = timeout
                ))

I tried to check the Azure SDK, but couldn't find a method that would tell me exactly when a job was completely deleted.

2

2 Answers

0
votes

Querying for existence of the job is the only way to determine if a job has been deleted from the system.

Alternatively, you can issue a delete job and then create a job with a different id, if you do not strictly need to reuse the same job id again. This will allow the job to delete asynchronously from your critical path.

0
votes

According to the exception log information you provide, I think it occurred because the delete job could consume a certain amount of time and you could't create the same id of the job during this time.

I suggest that you could add the check in step 3 to create the job, ensuring that the job with the same id has not been found in the account before you create it .

You could refer to snippet of the code as below to create job since you did not provide your code of creating job:

import azure.batch.batch_service_client as batch
import azure.batch.batch_auth as batchauth
import azure.batch.models as batchmodels

credentials = batchauth.SharedKeyCredentials(ACCOUNT_NAME,
                                             ACCOUNT_KEY)

batch_client = batch.BatchServiceClient(
    credentials,
    base_url=ACCOUNT_URL)


def createJob(jobId):

    while (batch_client.job.get(jobId)):
        print 'job still exists,can not be created'
    else:
        # Create Job
        job = batchmodels.JobAddParameter(
            jobId,
            batchmodels.PoolInformation(pool_id='mypool')
        )
        batch_client.job.add(job)
        print 'create success'

Hope it helps you.