What's causing JSON deserialization in USQL to hit a string limit?

Question

I have many (500+) JSON files which I'm processing in ADLA using USQL and the first thing I do is extract extract the data from each one using Microsoft.Analytics.Samples.Formats.Json JsonExtractor. Most (80%?) of the files go fine, including the largest ones, but some fail and I don't know why. Here's a minimal example of the code that fails:

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @input string="adl://abc.azuredatalakestore.net/data/whatever.json";

DECLARE @out string="adl://out.csv";

USING Microsoft.Analytics.Samples.Formats.Json;

@data =
EXTRACT SourceUrl string,
        Title string,
        Guest string,
        PublishDate DateTime,
        TranscriptionSections string,
        Categories string,
        filename string
FROM @input
USING new JsonExtractor();

OUTPUT @data
TO @out 
USING Outputters.Tsv(outputHeader : true);

Here's the error from Azure:

**Inner Error:**
ERROR
E_RUNTIME_USER_STRINGTOOBIG

MESSAGE
String size 132991 exceeds the maximum allowed size of 131072.

**Outer Error:**
DESCRIPTION
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract_Partition[0] with error: Vertex user code error.
RESOLUTION
DETAILS

Vertex SV1_Extract_Partition[0].v1 {8F874C31-C803-4C9A-9C3F-B594A62D7EAC} failed 

Error:
Vertex user code error

exitcode=CsExitCode_StillActive Errorsnippet=
ERROR
VertexFailedFast
MESSAGE
Vertex failed with a fail-fast error

Here's an example of the file I'm using:

{
"SourceUrl":"http://www.unittest.org/test.html",
"Title":"Unit Test File",
"Guest":"Unit Test Guest",
"PublishDate":"2017-05-15T00:00:00",
"TranscriptionSections":[  
    {  
    "SectionStartTime":"00:00:03",
    "Sentences":[  
        {  
        "Text":"Intro."
        },
        {  
        "Text":"Sentence one"
        },
        {  
        "Text":"Sentence two"
        }
    ]
},
{  
    "SectionStartTime":"00:04:46",
    "Sentences":[  
        {  
        "Text":"Sentence three"
        },
        {  
        "Text":"Sentence four"
        }
    ]
}
],
"Categories":null
}

After I do this first extract, I run another USQL statement to deserialize the TranscriptionSections string into more rows. Perhaps that's the error and there's a way to completely flatten the JSON file in one statement

hiren patel hiren patel · Accepted Answer · 2017-06-01T01:25:22

The error you are hitting is that one of the string column is exceeding max allowed string size. The allowed Max string size allowed is 128KB. At this point you have two alternatives 1. either write your own json extractor which enable to splits the column into two columns 2. use the byte[] datatype.

What's causing JSON deserialization in USQL to hit a string limit?

1 Answers