U-SQL Azure Data Lake Analytics search files by date

Question

I have U-SQL script that uses file pattern to find files in Azure Data Lake and extracts some data from them:

DECLARE @input_file string = @"\data\{*}\{*}\{*}.avro";
@data = EXTRACT 
        Column1 string,
        Column2 double
FROM @input_file
USING new MyExtractors.AvroExtractor();

File pattern is:

data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}

Problem: Custom extractor is executing very slow. I have many files in the Data Lake and it takes 15hrs to process and costs $600USD per run. Too slow and too expensive.

I only need to extract fresh data from files that are not more than 90 days old. How can I filter out old files using file pattern, file date modified or any other technique?

Hey Andrei, see here for how to work folder names with a date structure into a virtual column. Let me know if you need me to work up an example. — wBob
@wBob ty. I did see this. The problem is, I'll have filename after the extraction step. My problem is the extraction, I want to optimize fileset before the processing. — Andrei

Bo Xiao Bo Xiao · Accepted Answer · 2018-07-27T13:25:57

You can leverage GetMetadata activity in Azure data factory to retrieve lastModifiedTime of the files.

ref doc: Get metadata activity in Azure Data Factory

And there's a relevant post about incremental copy:

Azure data factory | incremental data load from SFTP to Blob

U-SQL Azure Data Lake Analytics search files by date

2 Answers