2
votes

Environment

  • Azure Data Factory/U-SQL

Scenario

  • I have ADF pipeline which runs on daily basis.And create the Folder and file in following format.Basically the Date wise folder and files are created.

  • "/Samples/Data/TestDB/{filedate:yyyy}.{filedate:MM}.{filedate:dd}/TestDetail.{filedate:yyyy}.{filedate:MM}.{filedate:dd}.csv";

  • In U-SQL I want to process the last 7 days files. I created the U-SQL script by having virtual column to select last 7 days of files.

  • I am running these changes on locally.

Code

 @userLoginData =
        EXTRACT filedate DateTime,
                UserLoginAuditDetailId Guid,
                UserId string              
        FROM    @userLoginDetailPath   
        USING   Extractors.Csv();

  @extractedLoginData = 
        SELECT 
                UserLoginAuditDetailId,
                UserId             
        FROM    @userLoginData
        WHERE   filedate.Date.ToString("d") == @sliceEndTime.AddDays(-7).Date.ToString("d"); 

Problem

I am able to run the script successfully, but the when the Job compile then job graph is showing, it is looking for the files which are more than 7 days. Please find below image.Basically it should not be looking for the file 2017.04.20. Right?

Job Graph

1
I wonder if the function calls in your WHERE clause are to complex for it to push the predicate back down to the extractor.... A bit like a field that is no longer sargable. Try changing the data type for the virtual field to a string and doing a simple comparison? - Paul Andrew
It is still picking the all the files - Mangesh Tambare

1 Answers

2
votes

You can achieve the solution by using the BETWEEN clause and comparing the your fileDate virtual column with the specified date something like,

DECLARE @ProcessStart DateTime = new DateTime(@sliceEndTime.Year, @sliceEndTime.Month,@sliceEndTime.Day).AddDays(-n); // n is the no of days 
DECLARE @ProcessEnd DateTime = new DateTime(@sliceEndTime.Year, @sliceEndTime.Month,@sliceEndTime.Day); // current day when slice runs

and then you can filter the filedate using

SELECT * FROM input WHERE    filedate  BETWEEN  @ProcessStart  AND  @ProcessEnd;

This will only process the files specified in the range of @ProcessStart and @ProcessEnd.

Let me know if it solves your issue.