0
votes

I'm working with Azure Data Factory v2, using a Batch Account Pool w/ dedicated nodes to do processing. I'm finding over time the Batch Activity fails due to no more space on the D:/ temp drive on the nodes. For each ADF job, it creates a working directory on the node and after the job completes I'm finding it doesn't clean up the files. Wondering if anybody else has encountered this before and what the best solution to implement is.

EDIT: Seems to be a file retention setting in ADF nowadays that wasn't present when I posed the question. For anybody future coming across the same issue that's a possible solution.

4

4 Answers

1
votes

Figured out a solution, posting to hopefully help the next person that comes along.

I found the Azure Python SDK for Batch, I created a small script that will iterate through all the pools + nodes on an account and delete any files in the workitems directory that are older than 1 day.

import azure.batch as batch
import azure.batch.operations.file_operations as file_operations
from azure.batch.batch_auth import SharedKeyCredentials
import azure.batch.operations
import msrest.service_client
from datetime import datetime

program_datetime = datetime.utcnow()

batch_account = 'batchaccount001'
batch_url = 'https://batchaccount001.westeurope.batch.azure.com'
batch_key = '<BatchKeyGoesHere>'
batch_credentials = SharedKeyCredentials(batch_account, batch_key)

#Create Batch Client with which to do operations
batch_client = batch.BatchServiceClient(credentials=batch_credentials,
                                        batch_url = batch_url
                                        )

service_client = msrest.service_client.ServiceClient(batch_credentials, batch_client.config)

#List out all the pools
pools = batch_client.pool.list()
pool_list = [p.id for p in pools]

for p in pool_list:
    nodes = batch_client.compute_node.list(p)
    node_list = [n.id for n in nodes]
    for n in node_list:
        pool_id = p
        node_id = n
        print(f'Pool = {pool_id}, Node = {node_id}')
        fo_client = azure.batch.operations.FileOperations(service_client,
                                                          config=batch_client.config,
                                                          serializer=batch_client._serialize,
                                                          deserializer=batch_client._deserialize)
        files = fo_client.list_from_compute_node(pool_id,
                                                 node_id,
                                                 recursive=True,
                                                 file_list_from_compute_node_options=None,
                                                 custom_headers=None,
                                                 raw=False
                                                )

        for file in files:
            # Check to make sure it's not a directory. Directories do not have a last_modified property.
            if not file.is_directory:
                file_datetime = file.properties.last_modified.replace(tzinfo=None)
                file_age_in_seconds = (program_datetime - file_datetime).total_seconds()
                # Delete anything older than a day in the workitems directory.
                if file_age_in_seconds > 86400 and file.name.startswith('workitems'):
                    print(f'{file_age_in_seconds} : {file.name}')
                    fo_client.delete_from_compute_node(pool_id, node_id, file.name)
1
votes

I'm an engineer with Azure Data Factory. We used an Azure Batch SDK earlier than 2018-12-01.8.0, thus the Batch tasks created via ADF defaulted to an infinite retention period as mentioned earlier. We're rolling out a fix to default the retention period for Batch tasks created through ADF to 30 days going forward, and also introducing a property, retentionTimeInDays in the typeProperties of custom activity, that customers can set in their ADF pipelines to override this default. When this has been rolled out, the documentation at https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity#custom-activity will be updated with more details. Thank you for your patience.

0
votes

Cleanup of tasks is done either when the task is deleted or when the tasks retention time is elapsed (https://docs.microsoft.com/en-us/rest/api/batchservice/task/add#taskconstraints). Either of these should solve the issue you are having.

Note: The default retention time has been decreased from infinite to 7 days in the latest REST API(2018-12-01.8.0) to allow task cleanup by default. Tasks created with versions prior to this will not have this new default.

-1
votes

You can use the retentionTimeInDays config in typeProperties when deploying via ARM Template.

Please note that you should provide the config retentionTimeInDays in Double and not String.