I have U-SQL script where I need to process some data. The data is stored in blob, with ~100 files per day in this folder structure: /{year}/{month}/{day}/{hour}/filenames.tsv
Getting one day of data is easy, just put a wildcard in the end and it will pick out all the files for all the hours for the day.
However, in my script I want to read out the current day and the last 2 hours of the previous day. The naive way is with 3 extract statements in this way:
DECLARE @input1 = @"/data/2017/10/08/22/{*}.tsv";
DECLARE @input2 = @"/data/2017/10/08/23/{*}.tsv";
DECLARE @input3 = @"/data/2017/10/09/{*}.tsv";
@x1 = EXTRACT .... FROM @input1 USING Extractors.Tsv();
@x2 = EXTRACT .... FROM @input2 USING Extractors.Tsv();
@x3 = EXTRACT .... FROM @input3 USING Extractors.Tsv();
But in my case each extract line is very long and complicated (~50 columns) using the AvroExtractor, so I would really prefer to only specify the columns and extractor once instead of 3 times. Also, by having 3 inputs its not possible from the caller side to decide how many hours from the previous days that should be read.
My question is how can I define this in a convenient way, ideally using only one extract statement?