1
votes

I've tried different methods to extract data from my JSON files and convert to CSV in U-SQL but they all seem to generate empty files or just output the header row.

I was previously trying JSON tuples but since that generated empty files I am now trying to use the MultiLevelJsonExtractor.

My JSON files are structured like this:

{
   "responseHeader":{
      "status":0,
      "QTime":25,
      "params":{
         "q":"query",
         "rows":"7000",
         "wt":"json"
      }
   },
   "response":{
      "docs":[
         {
            "uri":"www.google.com",
            "date_dt":"2017-05-30T23:00:00Z",
            "title":"Google"
         },
         {
            "uri":"www.yahoo.com",
            "date_dt":"2017-03-30T23:00:00Z",
            "title":"Yahoo"
         }
      ]
   }
}

My current code:

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @INPUT_FILE string = @"{*}.json";
DECLARE @OUTPUT_FILE string = @"/Output/Output.csv";

@json =
EXTRACT
    uri string,
    date_dt string,
    title string,
FROM
    @INPUT_FILE
USING new MultiLevelJsonExtractor("docs[*]", true, "uri", "date_dt", "title");

OUTPUT @json
TO @OUTPUT_FILE
USING Outputters.Csv(outputHeader:true,quoting:true);

This just outputs the header row. Is there an issue with the way I am referencing multiple files? e.g {*}.json?

I'm just trying to pull out the three fields within the docs nodeset from each JSON file and output to CSV.

1

1 Answers

0
votes

How exactly do you want the .csv to look on output? I just changed the json path to response.docs[*] and got two rows for your sample json, ie

@json =
    EXTRACT uri string,
            date_dt string,
            title string
    FROM @INPUT_FILE
    USING new MultiLevelJsonExtractor("response.docs[*]", true, "uri", "date_dt", "title");

My results:

My results