2
votes

I am running a data lake analytics job, and during extraction I am getting an error. I use in my scripts TEXT extractor and also my own extractor. I try to get data from a file containing two columns separated by a space character. When I run my scripts locally everything works fine, but not when I try to run scripts using my DLA account. I have the problem only when I try to get data from files with many thousands of rows (but only 36 MB of data), for smaller files everything also works correctly. I noticed that the exception is throwing when total number of vertices is larger than the one for the extraction node. I met this problem erlier, working with other "big" files (.csv, .tsv) and extractors. Could someone tell me what happens?

Error message:

Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][0] with error: Vertex user code error. Vertex failed with a fail-fast error

Script code:

@result =
EXTRACT s_date string,
        s_time string
FROM @"/Samples/napis.txt"
//USING USQLApplicationTest.ExtractorsFactory.getExtractor();
USING Extractors.Text(delimiter:' ');

OUTPUT @result
TO @"/Out/Napis.log"
USING Outputters.Csv();

Code behind:

[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{
    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {
        using (StreamReader sr = new StreamReader(input.BaseStream))
        {
            string line;
            // Read and display lines from the file until the end of 
            // the file is reached.
            while ((line = sr.ReadLine()) != null)
            {
                string[] words = line.Split(' ');
                int i = 0;
                foreach (var c in output.Schema)
                {
                    output.Set<object>(c.Name, words[i]);
                    i++;
                }

                yield return output.AsReadOnly();
            }
        }
    }
}

public static class ExtractorsFactory
{
    public static IExtractor getExtractor()
    {
        return new MyExtractor();
    }
}

Part of sample file:

...
str1 str2
str1 str2
str1 str2
str1 str2
str1 str2
...

In job resources i found jobError message:

"Unexpected number of columns in input stream."-"description":"Unexpected number of columns in input record at line 1.\nExpected 2 columns- processed 1 columns out of 1."-"resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results.

But I checked the file again and I don't see an incorrect number of columns. Is it possible that the error is caused by an incorrect file split and distribution? I read that the big files can be extracted in parallel. Sorry for my poor English.

1
Here is answer for my question: [msdn] social.msdn.microsoft.com/Forums/azure/en-US/…mieszko91

1 Answers

0
votes

The same question was answered here: https://social.msdn.microsoft.com/Forums/en-US/822af591-f098-4592-b903-d0dbf7aafb2d/vertex-failure-triggered-quick-job-abort-exception-thrown-during-data-extraction?forum=AzureDataLake.

Summary:

We currently have an issue with large files where the row is not aligned with the file extent boundary if you upload the file with the "wrong" tool. If you upload it as row-oriented file through Visual Studio or via the Powershell command, you should get it aligned (if the row delimiter is CR or LF). If you did not use the "right" upload tool, the built-in extractor will show the behavior that you report because it currently assumes that record boundaries are aligned to the extents that we split the file into for parallel processing. We are working on a general fix.

If you see similar error messages with your custom extractor that uses AtomicFileProcessing=true and should be immune to the split, please send me your job link so I can file an incident and have the engineering team review your case.