Concurrent file processing in data flow activity Azure Data Factory

Question

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.

However, when reading about data flows in the following article from Microsoft (https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:

A mapping data flow will execute better when the Source transformation iterates over multiple files instead of looping via the For Each activity. We recommend using wildcards or file lists in your source transformation. The Data Flow process will execute faster by allowing the looping to occur inside the Spark cluster. For more information, see Wildcarding in Source Transformation.

For example, if you have a list of data files from July 2019 that you wish to process in a folder in Blob Storage, below is a wildcard you can use in your Source transformation.

DateFiles/_201907.txt

By using wildcarding, your pipeline will only contain one Data Flow activity. This will perform better than a Lookup against the Blob Store that then iterates across all matched files using a ForEach with an Execute Data Flow activity inside.

Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.

Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?

Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?

Mark Kromer MSFT Mark Kromer MSFT · Accepted Answer · 2020-05-19T23:45:24

Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.

Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.

To process each file individually 1x1, use the control flow capabilities in the pipeline.

Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.

Concurrent file processing in data flow activity Azure Data Factory

1 Answers