3
votes

Scenario

I have a source file that contains blocks of JSON on each new line.

I then have a simple U-SQL extract as follows where [RawString] represents each new line in the file and the [FileName] is defined as a variable from the @SourceFile path.

@BaseExtract = 
    EXTRACT 
        [RawString] string, 
        [FileName] string
    FROM
        @SourceFile 
    USING 
        Extractors.Text(delimiter:'\b', quoting : false);

This executes without failure for the majority of my data and I'm able to parse the [RawString] as JSON further down in my script without any problems.

However, I seem to have an extra long row of data in a recent file that cannot be extracted.

Errors

Executing this both locally in Visual Studio and against my Data Lake Analytics service in Azure I get the following.

E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_TOO_LONG

Value too long failure when attempting to convert column data.

Can not convert string to proper type. The resulting data length is too long.

See screen shots below.

enter image description here

Having checked this with other tools I can confirm the length of the longest line in the source file is 189,943 characters.

Questions

So my questions for you my friends...

  1. Has anyone else hit this limit?
  2. What is the defined char line limit?
  3. What is the best way to work round this?
  4. Is a custom extractor going to be required?

Other Things

A couple of other thoughts...

  • As each new line in the file is a self contained JSON block of data I can't split the line.
  • If a manually copy the single long line into a separate file and format the JSON USQL handles it as expected with the Newtonsoft.Json libraries.
  • Currently I'm using VS2015 with Data Lake Tools version 2.2.7.

Thank you in advance for your support with this.

1

1 Answers

4
votes

The limit for a U-SQL string value in a column is currently 128kB (see https://msdn.microsoft.com/en-us/library/azure/mt764129.aspx).

In my experience a lot of people are running into it (especially when processing JSON). There are a few ways to work around it:

  1. Find a way to rewrite the extractor to return byte[] and avoid generating a string value until you really have to. That should give you more data (up to 4MB).

  2. Write a custom extractor that does all the navigation and decomposition of your specific JSON format down to the leaf nodes, thus avoiding intermediate long string values.

  3. Return SqlArray instead of string data type values and chunk the string into 128kB (in UTF-8 encoding, not the C#'s default UTF-16 encoding!).

We are looking into increasing the string size, but if you could file/vote up a request on http://aka.ms/adlfeedback that would be helpful.