1
votes

I have U-SQL script where I need to process some data. The data is stored in blob, with ~100 files per day in this folder structure: /{year}/{month}/{day}/{hour}/filenames.tsv

Getting one day of data is easy, just put a wildcard in the end and it will pick out all the files for all the hours for the day.

However, in my script I want to read out the current day and the last 2 hours of the previous day. The naive way is with 3 extract statements in this way:

DECLARE @input1 = @"/data/2017/10/08/22/{*}.tsv";
DECLARE @input2 = @"/data/2017/10/08/23/{*}.tsv";
DECLARE @input3 = @"/data/2017/10/09/{*}.tsv";

@x1 = EXTRACT .... FROM @input1 USING Extractors.Tsv();
@x2 = EXTRACT .... FROM @input2 USING Extractors.Tsv();
@x3 = EXTRACT .... FROM @input3 USING Extractors.Tsv();

But in my case each extract line is very long and complicated (~50 columns) using the AvroExtractor, so I would really prefer to only specify the columns and extractor once instead of 3 times. Also, by having 3 inputs its not possible from the caller side to decide how many hours from the previous days that should be read.

My question is how can I define this in a convenient way, ideally using only one extract statement?

1

1 Answers

2
votes

You could wrap your logic up into a U-SQL stored procedure so it is encapsulated. Then you need only make a few calls to the proc. A simple example:

CREATE PROCEDURE IF NOT EXISTS main.getContent(@inputPath string, @outputPath string)
AS
BEGIN;

    @output =
        EXTRACT 
         ...
        FROM @inputPath
        USING Extractors.Tsv();


    OUTPUT @output
    TO @outputPath
    USING Outputters.Tsv();

END;

Then to call it (untested):

main.getContent (
    @"/data/2017/10/08/22/{*}.tsv",
    @"/output/output1.tsv"
    )

main.getContent (
    @"/data/2017/10/08/23/{*}.tsv",
    @"/output/output2.tsv"
    )

main.getContent (
    @"/data/2017/10/09/{*}.tsv",
    @"/output/output3.tsv"
    )

That might be one way to go about it?