azure data lake u-sql pivot

votes

I am loving Azure Data Lake but lack of documentation will probably slow down the adoption. I hope somebody out there have more experinnce on U-SQL than I do.

Trying to derive from what's available under Microsoft.Analytics.Interfaces and via U-SQL interpreter with not much luck. Dynamic sql does not seem to be supported to define the schema of a row set at run time and IUpdatableRow's schema is readonly so Processor approach is not viable. And there is no out of the box PIVOT capability in U-SQL.

I also thought that maybe I can process the rowset all together and write a custom outputter to pivot but couldn't figure it out.

There is probably a really easy way to do this as it is a standard pivot operation. How would you go about reshaping a rowset from I to II for an indeterminate number of ColA and ColB values in a performant way?

I

|ColA |ColB |ColC|
|1    |A    |30  |
|1    |B    |70  |
|1    |ZA   |12  |
|2    |C    |22  |
|2    |A    |13  |

II

|ID   |A    |B    |C   |...... |ZA   |.....
|1    |30   |70   |0   |       |12   |
|2    |13   |0    |22  |...... |0    |.....

azure-data-lakeu-sql

3 Answers

votes

Note PIVOT / UNPIVOT syntax has been added to U-SQL as of March 2017.

Using the above sample data:

@t = SELECT *
     FROM(
        VALUES
        ( 1, "A", 30 ),
        ( 1, "B", 70 ),
        ( 1, "ZA", 12 ),
        ( 2, "C", 22 ),
        ( 2, "A", 13 ),
        ( 2, "ABC", 42)
     ) AS T(ColA, ColB, ColC);


@p =
    SELECT Column_0 AS id, Column_1 AS a
    FROM @t
      PIVOT (MAX(ColC) FOR ColB IN ("A" AS [A], "B" AS [B], "C" AS [C], "ZA" AS [ZA], "ABC" AS [ABC])
           ) AS pvt;


OUTPUT @p
TO "/output/pivot3.csv"
USING Outputters.Csv();

votes

You have several options for doing such a PIVOT.

Here is one that uses the U-SQL MAP data type (called SQL.MAP). Instead of 0 it will return null for missing values (use a null coalesce expression to turn it into 0) This will work under the following conditions:

The generated MAP stays within the row size limit of 4MB. If not, see the next solution.
You know ahead of time, what columns you have (if not, just keep the data in the map column and extract as needed).

Solution with map:

@t = SELECT *
     FROM(
        VALUES
        ( 1, "A", 30 ),
        ( 1, "B", 70 ),
        ( 1, "ZA", 12 ),
        ( 2, "C", 22 ),
        ( 2, "A", 13 ),
        ( 2, "ABC", 42)
     ) AS T(ColA, ColB, ColC);

@m = SELECT ColA AS [ID],
            MAP_AGG(ColB, (int?) ColC) AS m
     FROM @t
     GROUP BY ColA;

@r =
    SELECT [ID],
           m["A"]AS A,
           m["B"]AS B,
           m["C"]AS C,
           m["ZA"]AS [ZA],
           m["ABC"]AS [ABC]
    FROM @m;

OUTPUT @r
TO "/output/pivot1.csv"
USING Outputters.Csv();

And here is a solution that does use the standard SQL pivot work-around pattern (Some SQL database implementations actually used to translate the PIVOT expression into such an expression internally, and may still do it). Again, you will have to know all columns ahead of time. If that is not the case, just use the MAP datatype.

@t =
    SELECT *
    FROM(
        VALUES
        ( 1, "A", 30 ),
        ( 1, "B", 70 ),
        ( 1, "ZA", 12 ),
        ( 2, "C", 22 ),
        ( 2, "A", 13 ),
        ( 2, "ABC", 42)
    ) AS T(ColA, ColB, ColC);

@r =
    SELECT ColA AS [ID],
           (ColB == "A") ? ColC : 0 AS A,
           (ColB == "B") ? ColC : 0 AS B,
           (ColB == "C") ? ColC : 0 AS C,
           (ColB == "ZA") ? ColC : 0 AS [ZA],
           (ColB == "ABC") ? ColC : 0 AS [ABC]
    FROM @t;

@r =
    SELECT DISTINCT [ID],
           LAST_VALUE(A) OVER(PARTITION BY [ID] ORDER BY A) AS A,
           LAST_VALUE(B) OVER(PARTITION BY [ID] ORDER BY B) AS B,
           LAST_VALUE(C) OVER(PARTITION BY [ID] ORDER BY C) AS C,
           LAST_VALUE([ZA]) OVER(PARTITION BY [ID] ORDER BY [ZA]) AS [ZA],
           LAST_VALUE([ABC]) OVER(PARTITION BY [ID] ORDER BY [ABC]) AS [ABC]
    FROM @r;

OUTPUT @r
TO "/output/pivot2.csv"
USING Outputters.Csv();

votes

Here is one workaround that my team member come up with for a scenario where we don't know a number of columns ahead of time.

@t = SELECT *
     FROM(
        VALUES
        ( 1, "A", 30 ),
        ( 1, "B", 70 ),
        ( 1, "ZA", 12 ),
        ( 2, "C", 22 ),
        ( 2, "A", 13 ),
        ( 2, "ABC", 42)
     ) AS T(ColA, ColB, ColC);

@t1 =
    SELECT DISTINCT ColB
    FROM @t
ORDER BY ColB DESC
OFFSET 0 ROW;

@t1 =
    SELECT ARRAY_AGG(ColB) AS ColBArray
    FROM @t1;

@result =
    SELECT ColA,
           MAP_AGG(ColB, (int?) ColC) AS ColCMap
    FROM @t
    GROUP BY ColA;

@result =
    SELECT a.ColA,
           DPivotNS.DPivot.FillGapsAndConvert(a.ColCMap, b.ColBArray) AS Values
    FROM @result AS a
         CROSS JOIN
             @t1 AS b;

@result =
    SELECT ColA,
           ArrayColumn
    FROM
    (
    SELECT 0 AS ColA,
           ColBArray AS ArrayColumn,
           0 AS Ord
    FROM @t1
    UNION ALL
    SELECT ColA AS ColA,
           Values AS ArrayColumn,
           1 AS Ord
    FROM @result
    ) AS rs1
ORDER BY rs1.Ord
OFFSET 0 ROWS;

@result =
    SELECT ColA,
           String.Join(",", ArrayColumn) AS Values
    FROM @result;


OUTPUT @result
TO "result.csv"
USING Outputters.Csv(quoting:false);

Here is UDF for above script:

    public static SqlArray<string> FillGapsAndConvert (SqlMap<string, int?> ColCMap, SqlArray<string> ColDArray)
        {
        var list = new LinkedList<string> ();
        foreach ( string colD in ColDArray )
            {
            int? currentCount = ColCMap[colD];
            int newCount = currentCount.HasValue ? currentCount.Value : 0;
            list.AddLast (newCount.ToString ());
            }
        return new SqlArray<string> (list);
        }