1
votes

Within my Azure Data Lake directory ingest, I have timestamp-named subdirectories:

ingest/1510395023/file.tsv
ingest/1510425023/file.tsv
ingest/1510546210/file.tsv
...

In an Azure Data Factory v2 copy activity (Data Lake to SQL Server), how can I filter the Data Lake directories to those greater than a specified timestamp?

Assume I will record the already-copied timestamps into a SQL Server, and on the next pipeline run, I only want to copy incremental new directories based on a Lookup Activity.

Data Factory documentation references logical functions, but I don't understand how to implement them in this scenario.

Alternatively: If a logical comparison filter is available on the file name, as opposed to the directories, that would be helpful as well.

Note: I want to load in historical data as well, so only filtering a timestamp greater or equal to now is insufficient.

1

1 Answers

2
votes

I assume you have a Data Lake Analytics account.

Using a U-SQL activity, you can copy the targeted files to a single file, which you can then copy using a copy activity.

You can accomplish this by extracting a file set and filtering it by its virtual column.

Let @timestamp string be the input parameter, which was obtained via a lookup activity and which is the latest timestamp that is already processed:

@delta = 
    SELECT Col1, Col2 
    FROM(
        EXTRACT 
            [Col1] string,
            [Col2] string,
            [Timestamp] string
        FROM "ingest/{Timestamp}/file.tsv"
        USING Extractors.Tsv(skipFirstNRows : 1/* change if the header is not part of the file*/)
    ) AS T
    WHERE long.Parse([Timestamp]) > long.Parse(@timestamp);


OUTPUT @delta
TO "delta/file.tsv"
USING Outputters.Tsv(outputHeader : true);

From there you can copy "delta/file.tsv" to your database.