0
votes

I am stuck on an Azure Data Lake Analytics challenge, and am looking for assistance.

Issue

I need to load a file from Azure Data Lake that contains about 150m rows of JSON data, stored as an Array of JSON Objects. The file is ~110G in size. When I try to load the file, ADLA appears to hang, giving "0%" progress for hours and no status updates to the algebra.xml or other files.

Searching through online documentation, it appears that ADLA is limited to "a single vertex" for extracting files from storage. This seems odd, but the job graph does consistently say 1 vertex for input nodes. I tried a 32 AU job just in case and that did nothing. I also let a 1 AU job run for about 3.5 hours, and that also did nothing (i.e. never progressed).

Question

Do you know how to enable ADLA to read UTF-8 JSON Array files of large size with millions of member objects?

If so, how?

And, it would be wonderful if the answer uses U-SQL without need for code-behinds. I can do that, but the rest of my team can't and I'm evaluating ADLA "out-of-the-box" capabilities right now.

Many Thanks,

J

ps. Additional Details

I need to use the Array of JSON format due to JsonExtractor, as far as I know (plmk if that is not the case).

I need to use UTF-8 and JSON since my input data contains delimiters, largely due to human entry text fields that online folks "love" to play around with to give us data folks extra things to do :)

The below is the input file format, with special characters removed, since the file is UTF-8 format.

[{'key':'car'},{'key':'bar'},....,{'key':'yar'}]

By load, I mean use the EXTRACT statement with the open source assembly that provides JSONExtractor(). Please Note that the JsonExtractor currently expects an Array of JSON objects (at least it did the last time I looked at the source code).

The below is a code snippet to demonstrate the extract method.

#...The below assumes the assembly is loaded to a database named [SOMEDB]

USE DATABASE [SOMEDB];

REFERENCE ASSEMBLY [SOMEDB].[Newtonsoft.Json]; 
REFERENCE ASSEMBLY [SOMEDB].[Microsoft.Analytics.Samples.Formats];  

USING Microsoft.Analytics.Samples.Formats.Json; 

@input_dataset = EXTRACT
                   [ThereAreManyMoreColumnsThanThis]  decimal?
                  ,[HundredsOfColumns]                string
                  ,[YouGetTheIdea]                    DateTime
  FROM '/a/file/path/where/file/resides.json' 
  USING new JsonExtractor()

#...Other stuff
1

1 Answers

2
votes

Given the size of your JSON document, the example extractor is not designed for your use case. Here are some suggestions:

  1. Change the input format from an array to a sequence of row-delineated JSON documents. Assuming each row/document is less than 4MB, you can parallelize the extraction.

  2. Split your JSON array into smaller arrays in many files.

  3. Write your own custom extractor. Assuming you can identify each top element in your array, you could possibly even implement a parallel extractor (although that is going to be an advanced topic, since you need to handle the first and last parts of the file specially and make sure that you can handle the case where your end of element does not align with the extent boundary, similar to what the built-in Text Extractors do). In this case, make sure that you use the reader interface of your JSON parser instead of loading the document all into memory.