0
votes

I have U-SQL script that uses file pattern to find files in Azure Data Lake and extracts some data from them:

DECLARE @input_file string = @"\data\{*}\{*}\{*}.avro";
@data = EXTRACT 
        Column1 string,
        Column2 double
FROM @input_file
USING new MyExtractors.AvroExtractor();

File pattern is:

data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}

Problem: Custom extractor is executing very slow. I have many files in the Data Lake and it takes 15hrs to process and costs $600USD per run. Too slow and too expensive.

I only need to extract fresh data from files that are not more than 90 days old. How can I filter out old files using file pattern, file date modified or any other technique?

2
Hey Andrei, see here for how to work folder names with a date structure into a virtual column. Let me know if you need me to work up an example.wBob
@wBob ty. I did see this. The problem is, I'll have filename after the extraction step. My problem is the extraction, I want to optimize fileset before the processing.Andrei

2 Answers

1
votes

You can leverage GetMetadata activity in Azure data factory to retrieve lastModifiedTime of the files.

ref doc: Get metadata activity in Azure Data Factory

And there's a relevant post about incremental copy:

Azure data factory | incremental data load from SFTP to Blob

1
votes

You could use the .AddDays method of DateTime.Now, although whether or not this actually filters out all your files is (I think) dependent on your custom extractor, eg

//DECLARE @input_file string = @"\data\{*}\{*}\{*}.csv";
DECLARE @input_file string = @"\data\{Namespace}-{EventHub}-{PartitionId}\{xdate:yyyy}-{xdate:MM}-{xdate:dd}\{Hour}-{Minute}-{Second}.csv";


// data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}
@input =
    EXTRACT Column1 string,
            Column2 double,
            xdate DateTime,
            Namespace string,
            EventHub string,
            PartitionId string,
            Hour int,
            Minute int,
            Second int
    FROM @input_file
    USING Extractors.Csv();
//USING new MyExtractors.AvroExtractor();


@output =
    SELECT Column1,
           Column2
    FROM @input
    WHERE xdate > DateTime.Now.AddDays(-90);


OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv();

In my simple tests with .Csv this worked to reduce the input stream from 4 to 3 streams, but as mentioned I'm not sure if this will work with your custom extractor.

Local results