1
votes

i want to group data set based on unique values in column and save them to multiple file.

My problem is same as which is already described here at link: U-SQL Output in Azure Data Lake

As i am new to USQL language , i am unable to implement the second step from answer. I am unable to figure out how to write usql query to run generated usql script from the first part of answer

1
You can't write U-SQL that runs other U-SQL, ie dynamic U-SQL is not supported. What you do is 1) open the generated script and run it yourself or 2) run it via Data Factory 3) run it via Powershell etcwBob

1 Answers

1
votes

If the number of groups is known in advance, you could write a USQL stored procedure that would take as parameter 1) the value of the group 2) the name of the file.

In the pseudo-code below, the name of the final file is driven by the underlying value of the group. The data to be split is sourced from a USQL table (referred in the pseudo-code as ).

DROP PROCEDURE IF EXISTS splitByGroups;
CREATE PROCEDURE splitByGroups(@groupValue string, @file_name_prefix string = "extract")
AS
BEGIN
     DECLARE @OUTPUT string = "/output/" + file_name_prefix + "_"+ @groupValue + ".csv";

      OUTPUT (
        SELECT *
        FROM <MyTable>
        WHERE <MyGroup> == @groupValue
      )
      TO @OUTPUT
      USING Outputters.Csv(outputHeader : true);
END;

You would then execute the stored procedure as many times as you have groups:

splitByGroups("group1", DEFAULT); 
splitByGroups("group1", DEFAULT); 

Alternatively, if you wish to analyse the multiple files offline, I would download the full file and use the shell (PowerShell or Linux Shell) to split the file.