1
votes

I am a newbie to ADL & JSON files. I am trying to load a JSON file into an ADL table.

My JSON file structure is

{ABCD:{Time:"", Date:"", ProcessingTime:"", ProcessName:""}},
{ABCD:{Date:"", ProcessingTime:"", ProcessName:""}},
{ABCD:{ProcessingTime:"", ProcessName:""}},
{ABCD:{Time:"", Date:"", ProcessingTime:"", ProcessName:""}},

My table has all the 4 columns (Time, Data, ProcessingTime, & ProcessName).

First, I tried writing it to a CSV file using USQL statements before writing it into a table. But, the CSV output got generated with all blank records.

Any help is appreciated. Can I do this through ADF as well? I would like to have this as a scheduled job.

Below is the USQL code I used to write the CSV file.

CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM 
"C:/Test/Assemblies/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM 
"C:/ADL/Assemblies/Microsoft.Analytics.Samples.Formats.dll";

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @path string = @"C:\Test\";
DECLARE @input string = @path + @"sample_data1.json";
DECLARE @to string = @path + @"output.csv";

@jsonFile =
 EXTRACT
Time string,
Date string,
ProcessingTime string,
ProcessName string 
FROM @input
USING new JsonExtractor();

OUTPUT @jsonFile
TO @to
USING Outputters.Csv();

Cheers!

2

2 Answers

2
votes

That file does not contain a valid Json document. It seems to be a Json object per line. ADL can handle Json files with an object per line but each Json object should be written on a new line without any additional separators so you should remove the , at the end of each line. Like this:

{"ABCD":{"Time":"", "Date":"", "ProcessingTime":"", "ProcessName":""}}
{"ABCD":{"Date":"", "ProcessingTime":"", "ProcessName":""}}
{"ABCD":{"ProcessingTime":"", "ProcessName":""}}
{"ABCD":{"Time":"", "Date":"", "ProcessingTime":"", "ProcessName":""}}

Then you cannot use the JsonExtractor directly, you will have to use the text extractor to extract all separate Json lines and then use the JsonTuple method to convert it to Json:

CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM 
"C:/Test/Assemblies/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM 
"C:/ADL/Assemblies/Microsoft.Analytics.Samples.Formats.dll";

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @path string = @"C:\Test\";
DECLARE @input string = @path + @"sample_data1.json";
DECLARE @to string = @path + @"output.csv";

@RawExtract  = EXTRACT [RawString] string
FROM @input 
USING Extractors.Text(delimiter:'\b', quoting : false);

@ParsedJSONLines = SELECT JsonFunctions.JsonTuple([RawString]) AS JSONLine
    FROM @RawExtract;

@jsonObjects =
SELECT JsonFunctions.JsonTuple(JSONLine["ABCD"]) AS Abcd
FROM @ParsedJSONLines;

 @result =
SELECT 
    Abcd["Time"] AS Time,
    Abcd["Date"] AS Date,
    Abcd["ProcessingTime"] AS ProcessingTime,
    Abcd["ProcessName"] AS ProcessName
FROM @jsonObjects;

OUTPUT @result
TO @to
USING Outputters.Csv();
0
votes

Based on your additional clarification in the comments to Peter's reply:

First you cannot use U-SQL directly to insert data into Azure Table storage. You would have to use Azure Data Factory to move the cleaned/transformed files from ADLS to Azure Table.

The problem I see with the , above is that the JSON documents also contain , to separate their properties. So some simple approaches like using , as a row or column delimiter would fail. What you can do is to write something like (replacing the EXTRACT in the above script)

@RawExtract  =
  EXTRACT [RawString] string
  FROM @input 
  USING Extractors.Text(delimiter:'\b', quoting : false);

@RawExtract = SELECT RawString.TrimEnd(',') AS RawString FROM @RawExtract; 

to drop the last character of the row (assuming it is a , or you could write some other C# expression to find the location of the last comma and use String.Substring instead of String.TrimEnd). This assumes that each JSON document fits into a row and fits into the 128kB of a string datatype.

Alternatively, you would have to write a custom extractor that understands your file format completely and operate at the input.baseStream level with the Extractor UDO property atomicFileProcessing set to true. There are some example extractors available at the GitHub site linked from http://usql.io that may help with that. But I would suggest to try the above suggestion first.

Cheers Michael

PS: You can have all caps identifier in U-SQL, but you need to quote them, e.g., AS [ABCD].