I have a strange issue with a u-sql job that process zipped files. If I run the u-sql on a normal csv file it works fine. But if I gzip the file it doenst work anymore (generating a E_RUNTIME_USER_EXTRACT_ENCODING_ERROR: Encoding error occured after processing 0 record(s) in the vertex' input split.)
So the code that works is
DECLARE @path string = "output/{ids}/{*}.csv";
@data =
EXTRACT
a string,
b string,
c string,
d string,
ids string
FROM @path
USING
Extractors.Csv(skipFirstNRows:1, silent: true);
@output =
SELECT *
FROM @data
WHERE ids == "test";
OUTPUT @output
TO "output/res.csv"
USING Outputters.Csv(quoting : false, outputHeader: true);
This code does not work (gz version of the file)
DECLARE @path string = "output/{ids}/{*}.csv.gz";
@data =
EXTRACT
a string,
b string,
c string,
d string,
ids string
FROM @path
USING
Extractors.Csv(skipFirstNRows:1, silent: true);
@output =
SELECT *
FROM @data
WHERE ids == "test";
OUTPUT @output
TO "output/res.csv"
USING Outputters.Csv(quoting : false, outputHeader: true);
If I remove the virtual column "ids" it works for the gz version
DECLARE @path string = "output/test/{*}.csv.gz";
@data =
EXTRACT
a string,
b string,
c string,
d string
FROM @path
USING
Extractors.Csv(skipFirstNRows:1, silent: true);
@output =
SELECT *
FROM @data;
OUTPUT @output
TO "output/res.csv"
USING Outputters.Csv(quoting : false, outputHeader: true);
Attached is the two files I am using. Does anyone have a clue as to what is going on? If I remove the virual column ids it works for both?
I only get this error when I run against the file in the Data Lake Storage. If I run against the files locally it works fine.
The detailed error I receive is "internalDiagnostics":""-"innerError":{"diagnosticCode":195887128-"severity":"Error"-"component":"RUNTIME"-"source":"User"-"errorId":"E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER"-"message":"Invalid character for UTF-8 encoding in input stream."-"description":"Found invalid character for UTF-8 encoding in input."-"resolution":"Correct the invalid character in the input file- or correct encoding in extractor and try again."