3
votes

I have set up a DataFactory pipeline that gets a list of files in Azure Data Lake Storage Gen2 then iterates over each files using a ForEach loop.

Im using a Get Metadata activity to produce the list of files and the argument its outputting is 'Child Items'.

I want to make sure the list (child items) is always sorted in name order. My question is what is the default sorting method for child items or can i sort this manually?

Thanks

            "name": "GetMetadata",
            "description": "",
            "type": "GetMetadata",
            "dependsOn": [
                {
                    "activity": "Execute Previous Activity",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "policy": {
                "timeout": "7.00:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "dataset": {
                    "referenceName": "Folder",
                    "type": "DatasetReference"
                },
                "fieldList": [
                    "childItems"
                ]
            }
        },
3

3 Answers

4
votes

I've implemented the following solution to overcome the problem with get metadata default sorting order without a use of Azure Functions:

ADF pipeline design

  1. Get a list of items from the BLOB storage
  2. Apply custom filtering (out-of-scope in your question context - just skip)
  3. Apply a lookup activity, that basically receives the JSON representation of 1. Get Metadata activity, parses it using T-SQL Stored Procedure and returns batk as a sorted table representation of the input JSON (sorted in a descending manner)
  4. For each activity start to iterate through the list from the top to down, starting with the most recented date folders and moving to the oldest ones

Below you can find a:

Configuration of the lookup activity from 3.

Configuration of the lookup activity from 3.

T-SQL Stored Procedure that transforms the output of Get Metadata activity to an input of ForEach activity.

ALTER PROCEDURE Tech.spSortBlobMetadata 
     @Json     NVARCHAR(MAX)
   , @SortOder VARCHAR(5)    = 'DESC'
   , @Debug    INT           = 0
AS

/***************************************************************************
     EXEC Tech.spSortBlobMetadata 
       '[{"name":"dt=2020-06-17","type":"Folder"},{"name":"dt=2020-06-18"}]'
       , 'DESC'
       , 1
***************************************************************************/



    BEGIN

        DECLARE 
             @sqlTransform NVARCHAR(MAX) = 'SELECT *
FROM OPENJSON(@Json) WITH(name NVARCHAR(200) ''$.name'', type NVARCHAR(50) ''$.type'')
ORDER BY name ' + @SortOder



        IF @Debug = 0
            BEGIN


                EXEC sp_executesql 
                   @sqlTransform
                 , N'@Json nvarchar(max)'
                 , @json = @json
        END
            ELSE
            BEGIN
                SELECT @sqlTransform
        END


    END
1
votes

You should probably try and refactor your process to take advantage of one of the real strengths of Azure Data Factory (ADF) which is the ability to process things in parallel. What if you did DELETE based on the file / date / period instead of TRUNCATE?

I did get a sequential process to work using a Lookup to a database, a query with an ORDER BY clause to sort the output, and a For Each loop running in sequential mode, but this is counter to the strengths of ADF:

ADF pattern

0
votes

Unfortunatelly, there is not way to sort the order of ChildItems. I find this requirement quite strange, what is the scenario that you need you files sorted?