1
votes

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.

What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.

My issue is:

I started with this script:

CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @INPUT_FILE string = @"correct_path/Tracking_3e9.json";


//Extract the different properties from the Json file using a JsonExtractor
@json =
    EXTRACT Partition string, Custom string
    FROM @INPUT_FILE
    USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();


OUTPUT @json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
  1. I get error:

    E_STORE_USER_FILENOTFOUND: File not found or access denied

But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?

  1. I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.

Haven't found any tutorials on this.

  1. Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?

Thanks for your help.

1

1 Answers

2
votes

1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?

2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be @"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:

    EXTRACT
        column1 string,
        column2 string,
        day     int,
        hour    int,
        id      int
   FROM @input

3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:

    @refine=
        SELECT timestamp FROM @json;

     OUTPUT @refine
     ...

It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!