I am stuck on an Azure Data Lake Analytics challenge, and am looking for assistance.
Issue
I need to load a file from Azure Data Lake that contains about 150m rows of JSON data, stored as an Array of JSON Objects. The file is ~110G in size. When I try to load the file, ADLA appears to hang, giving "0%" progress for hours and no status updates to the algebra.xml or other files.
Searching through online documentation, it appears that ADLA is limited to "a single vertex" for extracting files from storage. This seems odd, but the job graph does consistently say 1 vertex for input nodes. I tried a 32 AU job just in case and that did nothing. I also let a 1 AU job run for about 3.5 hours, and that also did nothing (i.e. never progressed).
Question
Do you know how to enable ADLA to read UTF-8 JSON Array files of large size with millions of member objects?
If so, how?
And, it would be wonderful if the answer uses U-SQL without need for code-behinds. I can do that, but the rest of my team can't and I'm evaluating ADLA "out-of-the-box" capabilities right now.
Many Thanks,
J
ps. Additional Details
I need to use the Array of JSON format due to JsonExtractor, as far as I know (plmk if that is not the case).
I need to use UTF-8 and JSON since my input data contains delimiters, largely due to human entry text fields that online folks "love" to play around with to give us data folks extra things to do :)
The below is the input file format, with special characters removed, since the file is UTF-8 format.
[{'key':'car'},{'key':'bar'},....,{'key':'yar'}]
By load, I mean use the EXTRACT statement with the open source assembly that provides JSONExtractor(). Please Note that the JsonExtractor currently expects an Array of JSON objects (at least it did the last time I looked at the source code).
The below is a code snippet to demonstrate the extract method.
#...The below assumes the assembly is loaded to a database named [SOMEDB]
USE DATABASE [SOMEDB];
REFERENCE ASSEMBLY [SOMEDB].[Newtonsoft.Json];
REFERENCE ASSEMBLY [SOMEDB].[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@input_dataset = EXTRACT
[ThereAreManyMoreColumnsThanThis] decimal?
,[HundredsOfColumns] string
,[YouGetTheIdea] DateTime
FROM '/a/file/path/where/file/resides.json'
USING new JsonExtractor()
#...Other stuff