How do you define a date range and then query across a complex file set in USQL?

Question

There are many great examples of USQL across single files. But, how would you replicate a very common data processing example where you want to take the current system time, subtract X number of days from that time and query a set of data based on that result? For a SQL example:

SELECT * FROM MyTable
WHERE Date >= CAST(GETDATE() AS DATE) - 30
AND Date <= CAST(GETDATE() AS DATE) - 1

In this above example, my Dates is my file location such as:

'yyyy' | 'MM' | 'DD' | Filename.csv

-- Example path
/MyDirectory/2017/12/01/SomeData.csv

Therefore, is there a way in USQL with Azure Data Lake Analytics to do similar, but with the file location instead of querying everything with "{date:yyyy}/{date:MM}/{date:dd}/" expressions?

If that's not possible, what about specifing a range at least like:

"/MyDirectory/2017/{10-12}/{1-30}/{filename:*}.csv"

I can combine all files to one directory and use the natural date fields in the data to filter with a SELECT statement after the extractor, but the point of the directory structure is to reduce un-needed reads (transactions) and only targeting specific directories needed for a query based on the date of the said file itself.

Have you tried using the {date:MM}{date:dd}{date:yyyy} and then filtering using 'date' virtual column in next SELECT statement? Your job should in fact recognize the filter and only read the needed folders... — maya-msft
I just ran into an example of using the virtual columns. I just wasn't sure if that would only read the directories/files based on say, "WHERE date >= DateTime.Parse("2017-11-01")". If it only reads what is needed, not all the YYYY data, then that would be great and the answer to my question! I can obviously generate the date with a script if no system time functions exist in USQL. — Fastidious
It should partition eliminate the input set based on filters on virtual columns, and read only what's needed... — maya-msft

David Paul Giroux David Paul Giroux · Accepted Answer · 2017-12-01T17:19:32

Maya is correct. There are examples at U-SQL Language Reference and more specifically at EXTRACT Expression (U-SQL). See the example under "Multiple directories with multiple files". Here are some modifications to that example that appear to satisfy your ask.

1) The example is missing DECLARE @dir string = "/Samples/Data/AmbulanceData/";
2) Modify the DECLARE @file_set_path2 to read DECLARE @file_set_path2 string = @dir + "{date:yyyy}/{date:MM}/{date:dd}/vehicle{vid}_{*}.csv";
3. For your filter you could use WHERE date >= DateTime.Now.AddDays(-30) AND date <= DateTime.Now.AddDays(-1)

How do you define a date range and then query across a complex file set in USQL?

1 Answers