USQL - How To Select All Rows Between Two String Rows in USQL

Question

Here is my complete task description:

I have to extract data from multiple files using u-sql and output it into csv file. Every input file contains multiple reports based on some string rows ("START OF ..." and "END OF ..." working as report separator). Here is an example (data format) of a single source (input) file :

START OF DAILY ACCOUNT
some data 1
some data 2
some data 3
some data n
END OF DAILY ACCOUNT
START OF LEDGER BALANCE
some data 1
some data 2
some data 3
some data 4
some data 5
some data n
END OF LEDGER BALANCE
START OF DAILY SUMMARY REPORT
some data 1
some data 2
some data 3
some data n
END OF DAILY SUMMARY REPORT

So now my question is that how can I fetch records between "START OF ..." and "END OF ..." rows for all files?

I want something like this at the end :

@dailyAccountResult = [select all rows between "START OF DAILY ACCOUNT" and "END OF DAILY ACCOUNT" rows]

@ledgerBalanceResult = [select all rows between "START OF LEDGER BALANCE" and "END OF LEDGER BALANCE" rows]

@dailySummaryReportResult = [select all rows between "START OF DAILY SUMMARY REPORT" and "END OF DAILY SUMMARY REPORT" rows]

Do I need to write custom extractor for this? If yes then please suggest me how.

wBob wBob · Accepted Answer · 2016-12-13T11:32:49

I think this is possible using normal U-SQL without a custom extractor. I have created a simple example based on your sample data:

// Get raw input
@input =
    EXTRACT rawData string
    FROM "/input/input36.txt"
    USING Extractors.Tsv();


// Add a row number and break out the section;
// Get all [START OF ...] and [END OF ...] blocks and pair them.
// !!WARNING code assumes there are no duplicate sections, ie can not be more than one DAILY ACCOUNT section for example
@working =
    SELECT ROW_NUMBER() OVER() AS rn,
           System.Text.RegularExpressions.Regex.Match(rawData, "(START OF|END OF) (?<sectionName>.+)").Groups["sectionName"].ToString() AS sectionName,
           *
    FROM @input;


// Work out the section boundaries
@sections =
    SELECT sectionName,
           MIN(rn) AS startRn,
           MAX(rn) AS endRn,
           COUNT( * ) AS records
    FROM @working
    WHERE sectionName != ""
    GROUP BY sectionName;


// Create the output
@output =
    SELECT s.sectionName,
           i.rn == s.startRn ? 1 : 0 AS isStartSection,
           i.rn == s.endRn ? 1 : 0 AS isEndSection,
           i.rawData
    FROM @sections AS s
         CROSS JOIN
             @working AS i
    WHERE i.rn BETWEEN s.startRn AND s.endRn;


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

My results:

Now each section is tagged with a section name, you can easily assign the data to different variables and optionally include header/footer rows, eg

@dailyAccount =
    SELECT rawData
    FROM @output
    WHERE sectionName == "DAILY ACCOUNT"
          AND isStartSection == 0
          AND isEndSection == 0;

Give it a try and let me know how you get on.

USQL - How To Select All Rows Between Two String Rows in USQL

2 Answers