3
votes

Here is my complete task description:

I have to extract data from multiple files using u-sql and output it into csv file. Every input file contains multiple reports based on some string rows ("START OF ..." and "END OF ..." working as report separator). Here is an example (data format) of a single source (input) file :

START OF DAILY ACCOUNT
some data 1
some data 2
some data 3
some data n
END OF DAILY ACCOUNT
START OF LEDGER BALANCE
some data 1
some data 2
some data 3
some data 4
some data 5
some data n
END OF LEDGER BALANCE
START OF DAILY SUMMARY REPORT
some data 1
some data 2
some data 3
some data n
END OF DAILY SUMMARY REPORT

So now my question is that how can I fetch records between "START OF ..." and "END OF ..." rows for all files?

I want something like this at the end :

@dailyAccountResult = [select all rows between "START OF DAILY ACCOUNT" and "END OF DAILY ACCOUNT" rows]

@ledgerBalanceResult = [select all rows between "START OF LEDGER BALANCE" and "END OF LEDGER BALANCE" rows]

@dailySummaryReportResult = [select all rows between "START OF DAILY SUMMARY REPORT" and "END OF DAILY SUMMARY REPORT" rows]

Do I need to write custom extractor for this? If yes then please suggest me how.

2
Still unclear. Atleast for me.Imad
What is the size of your file?Jamil

2 Answers

4
votes

I think this is possible using normal U-SQL without a custom extractor. I have created a simple example based on your sample data:

// Get raw input
@input =
    EXTRACT rawData string
    FROM "/input/input36.txt"
    USING Extractors.Tsv();


// Add a row number and break out the section;
// Get all [START OF ...] and [END OF ...] blocks and pair them.
// !!WARNING code assumes there are no duplicate sections, ie can not be more than one DAILY ACCOUNT section for example
@working =
    SELECT ROW_NUMBER() OVER() AS rn,
           System.Text.RegularExpressions.Regex.Match(rawData, "(START OF|END OF) (?<sectionName>.+)").Groups["sectionName"].ToString() AS sectionName,
           *
    FROM @input;


// Work out the section boundaries
@sections =
    SELECT sectionName,
           MIN(rn) AS startRn,
           MAX(rn) AS endRn,
           COUNT( * ) AS records
    FROM @working
    WHERE sectionName != ""
    GROUP BY sectionName;


// Create the output
@output =
    SELECT s.sectionName,
           i.rn == s.startRn ? 1 : 0 AS isStartSection,
           i.rn == s.endRn ? 1 : 0 AS isEndSection,
           i.rawData
    FROM @sections AS s
         CROSS JOIN
             @working AS i
    WHERE i.rn BETWEEN s.startRn AND s.endRn;


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

My results: My results

Now each section is tagged with a section name, you can easily assign the data to different variables and optionally include header/footer rows, eg

@dailyAccount =
    SELECT rawData
    FROM @output
    WHERE sectionName == "DAILY ACCOUNT"
          AND isStartSection == 0
          AND isEndSection == 0;

Give it a try and let me know how you get on.

1
votes

The relevant questions to ask:

  1. In a distributed processing system, will all input data (which could be TBs) be processed by 1 Extractor instance?
    • Definitely not! For confirmation, see EXTRACT documentation (msdn.microsoft.com/en-us/library/azure mt621320.aspx).
  2. Given multiple extractor instances, where can data get split? Put another way, generically, what determines the unit of atomicity of data in U-Sql? Specifically for your case, what guarantee do you have that entire START...END sequences will be processed by one instance and not split up in the middle?
    • Data Lake Tools documentation suggests the generic unit of data atomicity is "line" (row-structured file) - and that this is a property of the data upload itself.
    • Per USQL Programmatibility guide, [SqlUserDefinedExtractor(AtomicFileProcessing = true)] ensures the entire input gets processed sequentially by 1 instance, which is sufficient and may be feasible for this case depending on the input size.
  3. Do Rowsets have order?

    • No! Rowsets are unordered concepts - think of them as non-deduping HashSets.

      var input = new HashSet<string>(File.ReadLines(@In_Data)); File.WriteAllLines(@Out_NewData, input)

      isn't expected to preserve the original line order (even if it does for some inputs, that's implementation detail, not guaranteed semantic behaviour).
      Ditto for rowsets - the input order of lines is lost (unguaranteed) the moment data gets translated into a rowset. Attempting to use ROW_NUMBER() is therefore fruitless - there is no order to preserve by the time ROW_NUMBER() can be invoked. The only way to use ROW_NUMBER() is if the rowset had some Key whose sort order could recreate the original order of lines.

Because rowsets do not have order, you need a custom extractor no matter what - it is the only part of the script able to observe the order of lines in the file, given

  • it uses AtomicFileProcessing, or
  • you figure out a way to guarantee data splits do not occur between START...END sequences. AFAIK there's no way to do this (short of preprocessing entire sequences into lines pre-upload).

You can choose to include all your logic in the custom extractor or simply add a numbered column to mimic ROW_NUMBER and use native U-Sql for the logic.