I am running a data lake analytics job, and during extraction I am getting an error. I use in my scripts TEXT extractor and also my own extractor. I try to get data from a file containing two columns separated by a space character. When I run my scripts locally everything works fine, but not when I try to run scripts using my DLA account. I have the problem only when I try to get data from files with many thousands of rows (but only 36 MB of data), for smaller files everything also works correctly. I noticed that the exception is throwing when total number of vertices is larger than the one for the extraction node. I met this problem erlier, working with other "big" files (.csv, .tsv) and extractors. Could someone tell me what happens?
Error message:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][0] with error: Vertex user code error. Vertex failed with a fail-fast error
Script code:
@result =
EXTRACT s_date string,
s_time string
FROM @"/Samples/napis.txt"
//USING USQLApplicationTest.ExtractorsFactory.getExtractor();
USING Extractors.Text(delimiter:' ');
OUTPUT @result
TO @"/Out/Napis.log"
USING Outputters.Csv();
Code behind:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (StreamReader sr = new StreamReader(input.BaseStream))
{
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
string[] words = line.Split(' ');
int i = 0;
foreach (var c in output.Schema)
{
output.Set<object>(c.Name, words[i]);
i++;
}
yield return output.AsReadOnly();
}
}
}
}
public static class ExtractorsFactory
{
public static IExtractor getExtractor()
{
return new MyExtractor();
}
}
Part of sample file:
...
str1 str2
str1 str2
str1 str2
str1 str2
str1 str2
...
In job resources i found jobError message:
"Unexpected number of columns in input stream."-"description":"Unexpected number of columns in input record at line 1.\nExpected 2 columns- processed 1 columns out of 1."-"resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results.
But I checked the file again and I don't see an incorrect number of columns. Is it possible that the error is caused by an incorrect file split and distribution? I read that the big files can be extracted in parallel. Sorry for my poor English.