u-sql issue using gzip and virtual column

Question

I have a strange issue with a u-sql job that process zipped files. If I run the u-sql on a normal csv file it works fine. But if I gzip the file it doenst work anymore (generating a E_RUNTIME_USER_EXTRACT_ENCODING_ERROR: Encoding error occured after processing 0 record(s) in the vertex' input split.)

So the code that works is

DECLARE @path string = "output/{ids}/{*}.csv";

@data =
    EXTRACT
        a string,
        b string,
        c string, 
        d string,
        ids string
    FROM  @path
    USING 
        Extractors.Csv(skipFirstNRows:1, silent: true);

@output = 
    SELECT *
    FROM @data 
    WHERE ids == "test";

OUTPUT @output
TO "output/res.csv"
USING Outputters.Csv(quoting : false, outputHeader: true);

This code does not work (gz version of the file)

DECLARE @path string = "output/{ids}/{*}.csv.gz";

@data =
    EXTRACT
        a string,
        b string,
        c string, 
        d string,
        ids string
    FROM  @path
    USING 
        Extractors.Csv(skipFirstNRows:1, silent: true);

@output = 
    SELECT *
    FROM @data 
    WHERE ids == "test";

OUTPUT @output
TO "output/res.csv"
USING Outputters.Csv(quoting : false, outputHeader: true);

If I remove the virtual column "ids" it works for the gz version

DECLARE @path string = "output/test/{*}.csv.gz";

@data =
    EXTRACT
        a string,
        b string,
        c string, 
        d string
    FROM  @path
    USING 
        Extractors.Csv(skipFirstNRows:1, silent: true);

@output = 
    SELECT *
    FROM @data;

OUTPUT @output
TO "output/res.csv"
USING Outputters.Csv(quoting : false, outputHeader: true);

Attached is the two files I am using. Does anyone have a clue as to what is going on? If I remove the virual column ids it works for both?

test.csv

test.csv.gz

I only get this error when I run against the file in the Data Lake Storage. If I run against the files locally it works fine.

The detailed error I receive is "internalDiagnostics":""-"innerError":{"diagnosticCode":195887128-"severity":"Error"-"component":"RUNTIME"-"source":"User"-"errorId":"E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER"-"message":"Invalid character for UTF-8 encoding in input stream."-"description":"Found invalid character for UTF-8 encoding in input."-"resolution":"Correct the invalid character in the input file- or correct encoding in extractor and try again."

Are you running locally or against your cloud ADLA account? Might be worth trying both, see if you get a different result. — wBob
I tried both options. I run the script in Visual Studio against the files locally on disk (this works). I also tried running the script in Visual Studio against files in my ADLA account as well as in the Azure Portal (both fail). Seems to be something with the virtual columns. If I remove these as shown in last example it works in all cases. — John
I can't reproduce this. The second script works perfectly well on my local and cloud ADLA account. What version of the tools are you using? I'm using 2.2.6000.1 with Visual Studio 2015. It may be worth submitting the script via the Azure portal which would rule out the tools, finally consider submitting a support request. — wBob
I am facing the exact same issue currently on a totally different query and data set. Not sure if it fixed after all.. — jayt.dev

charlesf_MSFT charlesf_MSFT · Accepted Answer · 2017-08-07T17:56:52

To add some additional the issue:

This was a confirmed defect. We have deployed a fix for this so you should not encounter theissue any longer, with or without @@FeaturesPreview = "FileSetv2Dot5:on" flag.
SET @@FeaturesPreview = "FileSetv2Dot5:on" flag as above was the correct workaround since it will force a different plan to be generated where the defect did not exist.
SET @@FeaturesPreview = "FileSetv2Dot5:on" is still turned off by default.

u-sql issue using gzip and virtual column

2 Answers