1
votes

I have a csv file which I am trying to process using Azure Data Lake Analytics U-SQL. I am fairly new to U-SQL so please bear with me. The original file is semi-structured which I managed to fix using the silent:true flag. Now that it is more structured, I would like to fill the empty cells with the data in the above cells.

My data looks like this: CSV with empty cells

My problem lies with the empty cells in the first four columns.

The second row has data which I would like to copy down into the empty cells below it (rows 3-5). The data from row 7 needs to be copied down to row 8, the data from row 9 to be copied down to rows 10-13 and the data from row 14 to be copied to rows 15-18.

This has to be done without changing the values in the 'Amount claimed' column.

Does anyone have any ideas on how to achieve this in U-SQL?

Thank you.

4
Thanks everyone for the guidance. I am not a developer so it may take a while to figure it out, but I sincerely appreciate all the help.Ally

4 Answers

3
votes

U-SQL is generally a language for processing large, order-agnostic data - this problem is not a good fit for it

  • There is no row ordering criterion in the input data.
  • Recreating input row ordering within U-SQL can only be done by restricting extraction to 1 vertex - which limits the input data size and is somewhat at odds with using a parallel large data processing language.

Rowsets - the fundamental USQL building blocks - are unordered logical data containers. Thus the order of lines in the original input is lost the moment you read it into a rowset; you have to recreate the order within U-SQL using some ordering Key.

Assuming there is such an ordering key,

@data = SELECT A, LAST_VALUE(Col == "" ? null : Col) OVER (ORDER BY Key ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col FROM @input;

should do it since LAST_VALUE should ignore nulls. Note - USQL documentation doesn't actually specify whether nulls are ignored - they should be per general aggregate / windowing function conventions, but this needs to be verified.

Your data doesn't have an ordering column - to create one, you would need to

  1. Ensure all data is processed by 1 vertex - [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
  2. Add an ordering column within a custom Extractor.

This may be too complicated for an amount of data that you could just process locally before uploading.

2
votes

The LAG analytic function provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT expression to compare values in the current row with values in a previous row.

https://msdn.microsoft.com/en-us/library/azure/mt791650.aspx

2
votes

An alternative approach(LAST_VALUE doesn't work for me):

If you have some row number or timestamp field then there is no problem

@tb1 = SELECT * FROM 
        ( VALUES
        (1, "Noah1"),
        (2, (string)null),
          (3, "Noah3"),
          (5, (string) null),
          (6, (string)null),
          (7, "Noah6"),
          (8, "Noah7")
        ) AS T(Timestamp, a);


@tb1 =
    SELECT Timestamp,
           [a],
           [a] != null && [a] != LEAD([a], 1) OVER(ORDER BY Timestamp ASC) AS aSwitch
    FROM @tb1;

@tb1 =
    SELECT Timestamp,
           [a],
           SUM(aSwitch ? 1 : 0) OVER(ORDER BY Timestamp ASC ROWS UNBOUNDED PRECEDING) AS aGrp
    FROM @tb1;

@tb1 =
    SELECT Timestamp,
           FIRST_VALUE([a]) OVER(PARTITION BY aGrp ORDER BY Timestamp ASC) AS aFilled
    FROM @tb1;

OUTPUT @tb1 TO "/test.csv" USING Outputters.Csv(outputHeader: true);

Result:

"Timestamp","aFilled"
1,"Noah1"
2,"Noah1"
3,"Noah3"
5,"Noah3"
6,"Noah3"
7,"Noah6"
8,"Noah7"

But what to do if you don't have such a field? In simple cases, you could use dumb field:

@tb1 = SELECT * FROM 
        ( VALUES
        ("Noah1"),
        ((string)null),
          ("Noah3"),
          ((string) null),
          ((string)null),
          ("Noah6"),
          ("Noah7")
        ) AS T(a);

@tb1 = SELECT 1 AS Timestamp,
              [a] 
            FROM  @tb1;
0
votes

I think you can solve this by using U-SQL user-defined operators (UDOs). In UDO you will iterate row by row and whenever you get row with empty values, copy data from previous row.